Related papers: NeedleBench: Can LLMs Do Retrieval and Reasoning in Information-Dense Context?

NeedleBench: Can LLMs Do Retrieval and Reasoning in Information-Dense Context?

URL: http://arxiv.org/abs/2407.11963v2
Date: Fri, 09 May 2025 09:23:22 GMT
Title: NeedleBench: Can LLMs Do Retrieval and Reasoning in Information-Dense Context?
Authors: Mo Li, Songyang Zhang, Taolin Zhang, Haodong Duan, Yunxin Liu, Kai Chen,
Abstract summary: NeedleBench is a framework for assessing retrieval and reasoning performance in long-context tasks.<n>It embeds key data points at varying depths to rigorously test model capabilities.<n>Our experiments reveal that reasoning models like Deep-R1 and OpenAI's o3 struggle with continuous retrieval and reasoning in information-dense scenarios.
Score: 43.98513461616172
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The capability of large language models to handle long-context information is crucial across various real-world applications. Existing evaluation methods often rely either on real-world long texts, making it difficult to exclude the influence of models' inherent knowledge, or introduce irrelevant filler content to artificially achieve target lengths, reducing assessment effectiveness. To address these limitations, we introduce NeedleBench, a synthetic framework for assessing retrieval and reasoning performance in bilingual long-context tasks with adaptive context lengths. NeedleBench systematically embeds key data points at varying depths to rigorously test model capabilities. Tasks are categorized into two scenarios: information-sparse, featuring minimal relevant details within extensive irrelevant text to simulate simple retrieval tasks; and information-dense (the Ancestral Trace Challenge), where relevant information is continuously distributed throughout the context to simulate complex reasoning tasks. Our experiments reveal that although recent reasoning models like Deepseek-R1 and OpenAI's o3 excel in mathematical reasoning, they struggle with continuous retrieval and reasoning in information-dense scenarios, even at shorter context lengths. We also characterize a phenomenon termed 'under-thinking', where models prematurely conclude reasoning despite available information. NeedleBench thus provides critical insights and targeted tools essential for evaluating and improving LLMs' long-context capabilities. All resources are available at OpenCompass: https://github.com/open-compass/opencompass.

Related papers

From Passive to Active Reasoning: Can Large Language Models Ask the Right Questions under Incomplete Information? [34.959850282872594]
We present AR-Bench, a novel benchmark designed explicitly to evaluate an LLM's active reasoning skills.<n>AR-Bench comprises three task families-detective cases, situation puzzles, and guessing numbers.<n> Empirical evaluation on AR-Bench demonstrates that contemporary LLMs exhibit pronounced difficulties with active reasoning.
arXiv Detail & Related papers (2025-06-09T23:56:41Z)
A Controllable Examination for Long-Context Language Models [45.47345679278309]
This study introduces $textbfLongBioBench, a novel benchmark for evaluating long-context language models.<n>We show that most models still exhibit deficiencies in semantic understanding and elementary reasoning.<n>LongBioBench achieves a better trade-off between mirroring authentic language tasks and maintaining controllability.
arXiv Detail & Related papers (2025-06-03T14:23:06Z)
LiveLongBench: Tackling Long-Context Understanding for Spoken Texts from Live Streams [4.917265821383127]
We construct the first spoken long-text dataset, derived from live streams, to reflect the redundancy-rich and conversational nature of real-world scenarios.<n>We evaluate both popular LLMs and specialized methods to assess their ability to understand long-contexts in these tasks.<n>Our findings highlight key limitations of current methods and suggest future directions for improving long-context understanding.
arXiv Detail & Related papers (2025-04-24T08:27:48Z)
On the Temporal Question-Answering Capabilities of Large Language Models Over Anonymized Data [1.2979906794584584]
The applicability of Large Language Models (LLMs) in temporal reasoning tasks over data that is not present during training is still a field that remains to be explored.<n>In this paper we work on this topic, focusing on structured and semi-structured anonymized data.<n>We identify and examined seventeen common temporal reasoning tasks in natural language, focusing on their algorithmic components.
arXiv Detail & Related papers (2025-04-10T10:48:42Z)
Generalizing From Short to Long: Effective Data Synthesis for Long-Context Instruction Tuning [103.65680870130839]
We investigate how to design instruction data for the post-training phase of a long context pre-trained model.<n>Our controlled study reveals that models instruction-tuned on short contexts can effectively generalize to longer ones.<n>Based on these findings, we propose context synthesis, a novel data synthesis framework.
arXiv Detail & Related papers (2025-02-21T17:02:40Z)
Emulating Retrieval Augmented Generation via Prompt Engineering for Enhanced Long Context Comprehension in LLMs [23.960451986662996]
This paper proposes a method that emulates Retrieval Augmented Generation (RAG) through specialized prompt engineering and chain-of-thought reasoning. We evaluate our approach on selected tasks from BABILong, which interleaves standard bAbI QA problems with large amounts of distractor text.
arXiv Detail & Related papers (2025-02-18T02:49:40Z)
Retrieval or Global Context Understanding? On Many-Shot In-Context Learning for Long-Context Evaluation [10.500629810624769]
We study long-context language models evaluation through many-shot in-context learning (ICL) We identify the skills each ICL task requires, and examine models' long-context capabilities on them. We introduce a new many-shot ICL benchmark, MANYICLBENCH, designed to characterize LCLMs' retrieval and global context understanding capabilities separately.
arXiv Detail & Related papers (2024-11-11T17:00:59Z)
Understanding Synthetic Context Extension via Retrieval Heads [51.8869530817334]
We investigate fine-tuning on synthetic data for three long-context tasks that require retrieval and reasoning.<n>We find that models trained on synthetic data fall short of the real data, but surprisingly, the mismatch can be interpreted.<n>Our results shed light on how to interpret synthetic data fine-tuning performance and how to approach creating better data for learning real-world capabilities over long contexts.
arXiv Detail & Related papers (2024-10-29T17:55:00Z)
Context is Key: A Benchmark for Forecasting with Essential Textual Information [87.3175915185287]
"Context is Key" (CiK) is a forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context.<n>We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters.<n>We propose a simple yet effective LLM prompting method that outperforms all other tested methods on our benchmark.
arXiv Detail & Related papers (2024-10-24T17:56:08Z)
LLM$\times$MapReduce: Simplified Long-Sequence Processing using Large Language Models [73.13933847198395]
We propose a training-free framework for processing long texts, utilizing a divide-and-conquer strategy to achieve comprehensive document understanding. The proposed LLM$times$MapReduce framework splits the entire document into several chunks for LLMs to read and then aggregates the intermediate answers to produce the final output.
arXiv Detail & Related papers (2024-10-12T03:13:44Z)
ACER: Automatic Language Model Context Extension via Retrieval [36.40066695682234]
Current open-weight generalist long-context models are still lacking in practical long-context processing tasks. We build an textbfautomatic data synthesis pipeline that mimics this process using short-context LMs. The short-context LMs are further tuned using these self-generated data to obtain task-specific long-context capabilities.
arXiv Detail & Related papers (2024-10-11T17:57:06Z)
DetectiveQA: Evaluating Long-Context Reasoning on Detective Novels [89.51834016940153]
We introduce DetectiveQA, a narrative reasoning benchmark with an average context length of over 100K tokens. We use detective novels as data sources, which naturally have various reasoning elements. We manually annotated 600 questions in Chinese and then also provided an English edition of the context information and questions.
arXiv Detail & Related papers (2024-09-04T06:28:22Z)
LongIns: A Challenging Long-context Instruction-based Exam for LLMs [44.51209510772957]
Long-context capabilities of large language models (LLMs) have been a hot topic in recent years. We propose the LongIns benchmark dataset, a challenging long-context instruction-based exam for LLMs.
arXiv Detail & Related papers (2024-06-25T14:31:26Z)
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning. Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z)
Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks [76.43527940649939]
We introduce Ada-LEval, a benchmark for evaluating the long-context understanding of large language models (LLMs) Ada-LEval includes two challenging subsets, TSort and BestAnswer, which enable a more reliable evaluation of LLMs' long context capabilities. We evaluate 4 state-of-the-art closed-source API models and 6 open-source models with Ada-LEval.
arXiv Detail & Related papers (2024-04-09T17:30:48Z)
XL$^2$Bench: A Benchmark for Extremely Long Context Understanding with Long-range Dependencies [45.31042312867939]
Large Language Models (LLMs) have demonstrated remarkable performance across diverse tasks but are constrained by their small context window sizes. Various efforts have been proposed to expand the context window to accommodate even up to 200K input tokens. We introduce a benchmark for extremely long context understanding with long-range dependencies, XL$2$Bench.
arXiv Detail & Related papers (2024-04-08T12:29:07Z)
NovelQA: Benchmarking Question Answering on Documents Exceeding 200K Tokens [63.7488938083696]
We introduce NovelQA, a benchmark tailored for evaluating Large Language Models (LLMs) with complex, extended narratives. NovelQA offers a unique blend of complexity, length, and narrative coherence, making it an ideal tool for assessing deep textual understanding. Our evaluation of long-context LLMs on NovelQA reveals significant insights into their strengths and weaknesses.
arXiv Detail & Related papers (2024-03-18T17:32:32Z)
LooGLE: Can Long-Context Language Models Understand Long Contexts? [46.143956498529796]
LooGLE is a benchmark for large language models' long context understanding. It features relatively new documents post-2022, with over 24,000 tokens per document and 6,000 newly generated questions spanning diverse domains. The evaluation of eight state-of-the-art LLMs on LooGLE revealed key findings.
arXiv Detail & Related papers (2023-11-08T01:45:37Z)
DIVKNOWQA: Assessing the Reasoning Ability of LLMs via Open-Domain Question Answering over Knowledge Base and Text [73.68051228972024]
Large Language Models (LLMs) have exhibited impressive generation capabilities, but they suffer from hallucinations when relying on their internal knowledge. Retrieval-augmented LLMs have emerged as a potential solution to ground LLMs in external knowledge.
arXiv Detail & Related papers (2023-10-31T04:37:57Z)
BAMBOO: A Comprehensive Benchmark for Evaluating Long Text Modeling Capacities of Large Language Models [141.21603469555225]
Large language models (LLMs) have achieved dramatic proficiency over NLP tasks with normal length. We propose BAMBOO, a multi-task long context benchmark. It consists of 10 datasets from 5 different long text understanding tasks.
arXiv Detail & Related papers (2023-09-23T11:36:15Z)
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding [58.20031627237889]
LongBench is the first bilingual, multi-task benchmark for long context understanding. It comprises 21 datasets across 6 task categories in both English and Chinese, with an average length of 6,711 words (English) and 13,386 characters (Chinese)
arXiv Detail & Related papers (2023-08-28T11:53:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.