LLMs are typically presumed to process context uniformly. However, in practice, this assumption does not hold. Model performance varies significantly as input length changes, even on simple tasks. Because these models achieve near-perfect scores on widely adopted benchmarks like Needle in a Haystack (NIAH), it’s often assumed that their performance is uniform across long-context tasks.
However, NIAH is fundamentally a simple retrieval task, in which a known sentence (the “needle”) is placed in a long document of unrelated text (the “haystack”), and the model is prompted to retrieve it. While scalable, this benchmark typically assesses direct lexical matching, which may not be representative of flexible, semantically oriented tasks.
This report evaluates 18 LLMs, including the state-of-the-art GPT-4.1, Claude 4, Gemini 2.5, and Qwen3 models. Results reveal that models do not use their context uniformly; instead, their performance grows increasingly unreliable as input length grows.
Needle in a Haystack Extension
The classic Needle in a Haystack task involves placing a random fact (the ‘needle’) in the middle of a long context window (the ‘haystack’), then asking the model about that fact.
The original implementation of this task uses a needle-question pair with lexical matches. However, usage of long context in practice often requires semantic understanding of ambiguous tasks.
NoLiMa has demonstrated non-lexical matching to be a challenge for models as context length increases. This task utilizes needle-question pairs that require models to infer latent associations.
Testing the impact of non-lexical matching in isolation remains underexplored. Furthermore, this binary distinction of “lexical” versus “non-lexical” oversimplifies the complexity of question-answering in real-world scenarios. Needle-question pairs exist on a spectrum of similarity, yet they are all classified under these broad categories.
Models often have to deal with distractors as well, which has been shown to degrade performance.
- Distractors are topically related to the needle, but do not quite answer the question
- Irrelevant content is unrelated to the needle and question
Another underexplored aspect of NIAH is the haystack itself, which is often simply treated as a means of scaling input length, but this assumes that the haystack content itself has no effect on task performance. If the model is indeed insensitive to the content of the haystack, then varying this content, for example the haystack’s topic or narrative flow, should have no influence on the results. However, this assumption remains largely untested.
Four controlled experiments are designed to investigate the influence of these factors:
Needle-Question Similarity
Cosine similarity between needle-question pairs is computed using embeddings. For robustness, the average across five embedding models is taken: text-embedding-3-small, text-embedding-3-large, jina-embeddings-v3, voyage-3-large, and all-MiniLM-L6-v2. Model performance is measured as input length increases, considering the impact of needle-question similarity.
Impact of Distractors
Taking a high-similarity needle-question pair, four distractors are written. The following setups are used:
- Baseline: needle only, no distractors
- Single distractor: needle + one randomly positioned distractor
- Multiple distractors: needle + all four distractors randomly positioned
The impact of distractors on model performance is tested as input length increases to measure non-uniformity amongst distractors and input lengths.
Needle-Haystack Similarity
Two thematically distinct haystacks, Paul Graham essays and arXiv papers, are used. Corresponding needles are written for each. To measure needle-haystack similarity, the haystack is embedded and the top-5 chunks for each needle are retrieved. The average cosine similarity scores are then calculated. This process is repeated across five different embedding models for robustness.
Haystack Structure
In typical NIAH setups, haystacks are concatenations of coherent texts, each with their own logical flow of ideas. For instance, the original NIAH benchmark uses a series of Paul Graham essays, where each essay follows a structured organization of ideas to form an argument. To evaluate whether this structure influences model performance, two conditions are compared:
- Original: preserves the natural flow of ideas within each excerpt
- Shuffled: sentences are randomly reordered throughout the haystack to maintain the same overall topic without logical continuity
Details
For every unique combination of needle type, haystack topic, and haystack structure, models are tested across:
- 8 input lengths
- 11 needle positions
Models are evaluated across their maximum context window with temperature=0 unless that setting is incompatible (i.e. o3) or explicitly discouraged (i.e. Qwen’s “thinking mode”). For Qwen models, the YaRN method is applied to extend from 32,768 to 131,072 tokens.
Models are included in both standard and “thinking mode” where applicable. Model outputs are evaluated using an aligned GPT-4.1 judge.
Needle-Question Similarity
The experiment uses two domains for the haystack content: Paul Graham (PG) essays and arXiv papers.
For each haystack topic, common themes were identified to guide question and needle writing. This involved chunking documents, embedding the chunks, using UMAP for dimensionality reduction, and HDBSCAN for clustering. Representative chunks from the largest clusters were examined manually to determine their themes and style. For PG essays, writing advice was identified as a common theme, while information retrieval, specifically re-ranking, was identified for arXiv papers.
Corresponding questions were written for each topic. Before writing the needles, it was verified that answers to these questions did not exist in the haystack content by querying a vector database of haystack chunk embeddings and manually examining the top-10 results.
For each question, eight needles were written, each belonging to the large cluster, verified using approximate predictions. Needles belonging to the writing/retrieval cluster with >0.9 probability were considered to topically blend into the haystack. The needles were manually written to avoid data contamination. The level of ambiguity was varied for the eight needles by computing embeddings for the needle and question and their cosine similarity across five embedding models. For the PG essays topic, the needles ranged from 0.445–0.775 needle-question similarity, and for the arXiv topic, the range was 0.521–0.829.