Solega Co. Done For Your E-Commerce solutions.
  • Home
  • E-commerce
  • Start Ups
  • Project Management
  • Artificial Intelligence
  • Investment
  • More
    • Cryptocurrency
    • Finance
    • Real Estate
    • Travel
No Result
View All Result
  • Home
  • E-commerce
  • Start Ups
  • Project Management
  • Artificial Intelligence
  • Investment
  • More
    • Cryptocurrency
    • Finance
    • Real Estate
    • Travel
No Result
View All Result
No Result
View All Result
Home Artificial Intelligence

Papers Explained 443: Context Rot | by Ritvik Rastogi | Sep, 2025

Solega Team by Solega Team
September 3, 2025
in Artificial Intelligence
Reading Time: 12 mins read
0
0
SHARES
0
VIEWS
Share on FacebookShare on Twitter


Ritvik Rastogi

Press enter or click to view image in full size

LLMs are typically presumed to process context uniformly. However, in practice, this assumption does not hold. Model performance varies significantly as input length changes, even on simple tasks. Because these models achieve near-perfect scores on widely adopted benchmarks like Needle in a Haystack (NIAH), it’s often assumed that their performance is uniform across long-context tasks.

However, NIAH is fundamentally a simple retrieval task, in which a known sentence (the “needle”) is placed in a long document of unrelated text (the “haystack”), and the model is prompted to retrieve it. While scalable, this benchmark typically assesses direct lexical matching, which may not be representative of flexible, semantically oriented tasks.

This report evaluates 18 LLMs, including the state-of-the-art GPT-4.1, Claude 4, Gemini 2.5, and Qwen3 models. Results reveal that models do not use their context uniformly; instead, their performance grows increasingly unreliable as input length grows.

Needle in a Haystack Extension

The classic Needle in a Haystack task involves placing a random fact (the ‘needle’) in the middle of a long context window (the ‘haystack’), then asking the model about that fact.

The original implementation of this task uses a needle-question pair with lexical matches. However, usage of long context in practice often requires semantic understanding of ambiguous tasks.

Press enter or click to view image in full size

Example Needle in a Haystack (NIAH) Setup

NoLiMa has demonstrated non-lexical matching to be a challenge for models as context length increases. This task utilizes needle-question pairs that require models to infer latent associations.

Testing the impact of non-lexical matching in isolation remains underexplored. Furthermore, this binary distinction of “lexical” versus “non-lexical” oversimplifies the complexity of question-answering in real-world scenarios. Needle-question pairs exist on a spectrum of similarity, yet they are all classified under these broad categories.

Models often have to deal with distractors as well, which has been shown to degrade performance.

Press enter or click to view image in full size

Comparison — Distractor vs. Irrelevant Context
  • Distractors are topically related to the needle, but do not quite answer the question
  • Irrelevant content is unrelated to the needle and question

Another underexplored aspect of NIAH is the haystack itself, which is often simply treated as a means of scaling input length, but this assumes that the haystack content itself has no effect on task performance. If the model is indeed insensitive to the content of the haystack, then varying this content, for example the haystack’s topic or narrative flow, should have no influence on the results. However, this assumption remains largely untested.

Four controlled experiments are designed to investigate the influence of these factors:

Needle-Question Similarity

Cosine similarity between needle-question pairs is computed using embeddings. For robustness, the average across five embedding models is taken: text-embedding-3-small, text-embedding-3-large, jina-embeddings-v3, voyage-3-large, and all-MiniLM-L6-v2. Model performance is measured as input length increases, considering the impact of needle-question similarity.

Impact of Distractors

Taking a high-similarity needle-question pair, four distractors are written. The following setups are used:

  • Baseline: needle only, no distractors
  • Single distractor: needle + one randomly positioned distractor
  • Multiple distractors: needle + all four distractors randomly positioned

The impact of distractors on model performance is tested as input length increases to measure non-uniformity amongst distractors and input lengths.

Needle-Haystack Similarity

Two thematically distinct haystacks, Paul Graham essays and arXiv papers, are used. Corresponding needles are written for each. To measure needle-haystack similarity, the haystack is embedded and the top-5 chunks for each needle are retrieved. The average cosine similarity scores are then calculated. This process is repeated across five different embedding models for robustness.

Haystack Structure

In typical NIAH setups, haystacks are concatenations of coherent texts, each with their own logical flow of ideas. For instance, the original NIAH benchmark uses a series of Paul Graham essays, where each essay follows a structured organization of ideas to form an argument. To evaluate whether this structure influences model performance, two conditions are compared:

  • Original: preserves the natural flow of ideas within each excerpt
  • Shuffled: sentences are randomly reordered throughout the haystack to maintain the same overall topic without logical continuity

Details

For every unique combination of needle type, haystack topic, and haystack structure, models are tested across:

  • 8 input lengths
  • 11 needle positions

Models are evaluated across their maximum context window with temperature=0 unless that setting is incompatible (i.e. o3) or explicitly discouraged (i.e. Qwen’s “thinking mode”). For Qwen models, the YaRN method is applied to extend from 32,768 to 131,072 tokens.

Models are included in both standard and “thinking mode” where applicable. Model outputs are evaluated using an aligned GPT-4.1 judge.

Needle-Question Similarity

The experiment uses two domains for the haystack content: Paul Graham (PG) essays and arXiv papers.

For each haystack topic, common themes were identified to guide question and needle writing. This involved chunking documents, embedding the chunks, using UMAP for dimensionality reduction, and HDBSCAN for clustering. Representative chunks from the largest clusters were examined manually to determine their themes and style. For PG essays, writing advice was identified as a common theme, while information retrieval, specifically re-ranking, was identified for arXiv papers.

Corresponding questions were written for each topic. Before writing the needles, it was verified that answers to these questions did not exist in the haystack content by querying a vector database of haystack chunk embeddings and manually examining the top-10 results.

For each question, eight needles were written, each belonging to the large cluster, verified using approximate predictions. Needles belonging to the writing/retrieval cluster with >0.9 probability were considered to topically blend into the haystack. The needles were manually written to avoid data contamination. The level of ambiguity was varied for the eight needles by computing embeddings for the needle and question and their cosine similarity across five embedding models. For the PG essays topic, the needles ranged from 0.445–0.775 needle-question similarity, and for the arXiv topic, the range was 0.521–0.829.



Source link

Previous Post

Ethereum NFT Activity Plummets to Lowest Level Ever Recorded

Next Post

Astro Teller is joining the stage at Disrupt 2025 in October

Next Post
Astro Teller is joining the stage at Disrupt 2025 in October

Astro Teller is joining the stage at Disrupt 2025 in October

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

POPULAR POSTS

  • 10 Ways To Get a Free DoorDash Gift Card

    10 Ways To Get a Free DoorDash Gift Card

    0 shares
    Share 0 Tweet 0
  • The Role of Natural Language Processing in Financial News Analysis

    0 shares
    Share 0 Tweet 0
  • They Combed the Co-ops of Upper Manhattan With $700,000 to Spend

    0 shares
    Share 0 Tweet 0
  • Saal.AI and Cisco Systems Inc Ink MoU to Explore AI and Big Data Innovations at GITEX Global 2024

    0 shares
    Share 0 Tweet 0
  • How To Sell Gold (Step-By-Step Guide)

    0 shares
    Share 0 Tweet 0
Solega Blog

Categories

  • Artificial Intelligence
  • Cryptocurrency
  • E-commerce
  • Finance
  • Investment
  • Project Management
  • Real Estate
  • Start Ups
  • Travel

Connect With Us

Recent Posts

Shin Starr’s robotic food truck kitchen will serve up Korean BBQ at TechCrunch Disrupt 2025

Shin Starr’s robotic food truck kitchen will serve up Korean BBQ at TechCrunch Disrupt 2025

October 21, 2025
The Indian Startup Daring to Rewrite the Ad Agency Rulebook with AI

The Indian Startup Daring to Rewrite the Ad Agency Rulebook with AI

October 21, 2025

© 2024 Solega, LLC. All Rights Reserved | Solega.co

No Result
View All Result
  • Home
  • E-commerce
  • Start Ups
  • Project Management
  • Artificial Intelligence
  • Investment
  • More
    • Cryptocurrency
    • Finance
    • Real Estate
    • Travel

© 2024 Solega, LLC. All Rights Reserved | Solega.co