Related papers: Beyond the Needle's Illusion: Decoupled Evaluation of Evidence Access and Use under Semantic Interference at 326M-Token Scale

Beyond the Needle's Illusion: Decoupled Evaluation of Evidence Access and Use under Semantic Interference at 326M-Token Scale

URL: http://arxiv.org/abs/2601.20276v1
Date: Wed, 28 Jan 2026 05:44:00 GMT
Title: Beyond the Needle's Illusion: Decoupled Evaluation of Evidence Access and Use under Semantic Interference at 326M-Token Scale
Authors: Tianwei Lin, Zuyi Zhou, Xinda Zhao, Chenke Wang, Xiaohong Li, Yu Chen, Chuanrui Hu, Jian Pei, Yafeng Deng,
Abstract summary: We introduce EverMemBench-S (EMB-S), an adversarial NIAH-style benchmark built on a 326M-token MemoryBank.<n>While the full MemoryBank spans 326M tokens for retrieval-based (RAG) evaluation, we evaluate native long-context models only at scales that fit within each model's context window.
Score: 18.13756357502514
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Long-context LLM agents must access the right evidence from large environments and use it faithfully. However, the popular Needle-in-a-Haystack (NIAH) evaluation mostly measures benign span localization. The needle is near-unique, and the haystack is largely irrelevant. We introduce EverMemBench-S (EMB-S), an adversarial NIAH-style benchmark built on a 326M-token MemoryBank. While the full MemoryBank spans 326M tokens for retrieval-based (RAG) evaluation, we evaluate native long-context models only at scales that fit within each model's context window (up to 1M tokens in this work) to ensure a fair comparison. EMB-S pairs queries with collision-tested near-miss hard negatives and gold evidence sets spanning one or more documents, validated via human screening and LLM verification. We also propose a decoupled diagnostic protocol that reports evidence access (document-ID localization) separately from end-to-end QA quality under full-context prompting. This enables consistent diagnosis for both native long-context prompting and retrieval pipelines. Across a reference-corpus ladder from domain-isolated 64K contexts to a globally shared 326M-token environment, we observe a clear reality gap. Systems that saturate benign NIAH degrade sharply in evidence access under semantic interference. These results indicate that semantic discrimination, not context length alone, is the dominant bottleneck for long-context memory at scale.

Related papers

LogicScore: Fine-grained Logic Evaluation of Conciseness, Completeness, and Determinateness in Attributed Question Answering [29.294167109756042]
We present textscLogicScore, a unified evaluation framework that shifts the paradigm from local assessment to global reasoning scrutiny.<n>Our approach integrates a backward verification mechanism to evaluate three key reasoning dimensions: textitCompleteness (logically sound deduction), textitConciseness (non-redundancy) and textitDeterminateness (consistent answer entailment)
arXiv Detail & Related papers (2026-01-21T14:52:03Z)
A BERTology View of LLM Orchestrations: Token- and Layer-Selective Probes for Efficient Single-Pass Classification [2.0069888187253615]
Production LLM systems often rely on separate models for safety and other classification-heavy steps.<n>We instead reuse computation already paid for by the serving LLM: we train lightweight probes on its hidden states and predict labels in the same forward pass used for generation.
arXiv Detail & Related papers (2026-01-19T18:40:29Z)
Short-Context Dominance: How Much Local Context Natural Language Actually Needs? [48.429870236229696]
We measure the minimum context length needed to reproduce accurate full-context predictions.<n>For sequences with 1-7k tokens from long-context documents, we consistently find that 75-80% require only the last 96 tokens at most.<n>We introduce a practical proxy to MCL, called Distributionally Aware MCL (DaMCL), that does not require knowledge of the actual next-token.
arXiv Detail & Related papers (2025-12-08T22:25:00Z)
Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation [59.40886078302025]
Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in aligning visual inputs with natural language outputs.<n>Yet, the extent to which generated tokens depend on visual modalities remains poorly understood.<n>We present a lightweight black-box framework for explaining autoregressive token generation in MLLMs.
arXiv Detail & Related papers (2025-09-26T15:38:42Z)
Can Multimodal Large Language Models Understand Spatial Relations? [16.76001474065412]
We introduce SpatialMQA, a human-annotated spatial relation reasoning benchmark based on COCO 2017.<n>Results indicate that the current state-of-the-art MLLM achieves only 48.14% accuracy, far below the human-level accuracy of 98.40%.
arXiv Detail & Related papers (2025-05-25T07:37:34Z)
Retrieval-Augmented Generation with Conflicting Evidence [57.66282463340297]
Large language model (LLM) agents are increasingly employing retrieval-augmented generation (RAG) to improve the factuality of their responses.<n>In practice, these systems often need to handle ambiguous user queries and potentially conflicting information from multiple sources.<n>We propose RAMDocs (Retrieval with Ambiguity and Misinformation in Documents), a new dataset that simulates complex and realistic scenarios for conflicting evidence for a user query.
arXiv Detail & Related papers (2025-04-17T16:46:11Z)
Collaborative Stance Detection via Small-Large Language Model Consistency Verification [11.512736305376654]
Stance detection on social media aims to identify attitudes expressed in tweets towards specific targets.<n>Heavily relying on Large Language Models (LLMs) for stance detection is impractical for real-world social media monitoring systems.<n>We propose textbfunderlineCollaborative Stance Detection via Small-Large Language Model Consistency.
arXiv Detail & Related papers (2025-02-27T10:30:50Z)
NoLiMa: Long-Context Evaluation Beyond Literal Matching [100.00398424275501]
NoLiMa is a benchmark extending the needle-in-a-haystack (NIAH) test.<n>It requires models to infer latent associations to locate the needle within the haystack.<n>We evaluate 13 popular large language models that claim to support contexts of at least 128K tokens.
arXiv Detail & Related papers (2025-02-07T18:49:46Z)
Not all tokens are created equal: Perplexity Attention Weighted Networks for AI generated text detection [49.15148871877941]
Next-token distribution outputs offer a theoretically appealing approach for detection of large language models (LLMs)<n>We propose the Perplexity Attention Weighted Network (PAWN), which uses the last hidden states of the LLM and positions to weight the sum of a series of features based on metrics from the next-token distribution across the sequence length.<n>PAWN shows competitive and even better performance in-distribution than the strongest baselines with a fraction of their trainable parameters.
arXiv Detail & Related papers (2025-01-07T17:00:49Z)
See, Say, and Segment: Teaching LMMs to Overcome False Premises [67.36381001664635]
We propose a cascading and joint training approach for LMMs to solve this task. Our resulting model can "see" by detecting whether objects are present in an image, "say" by telling the user if they are not, and finally "segment" by outputting the mask of the desired objects if they exist.
arXiv Detail & Related papers (2023-12-13T18:58:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.