Related papers: Is It Really Long Context if All You Need Is Retrieval? Towards Genuinely Difficult Long Context NLP

Is It Really Long Context if All You Need Is Retrieval? Towards Genuinely Difficult Long Context NLP

URL: http://arxiv.org/abs/2407.00402v3
Date: Sun, 06 Oct 2024 09:09:26 GMT
Title: Is It Really Long Context if All You Need Is Retrieval? Towards Genuinely Difficult Long Context NLP
Authors: Omer Goldman, Alon Jacovi, Aviv Slobodkin, Aviya Maimon, Ido Dagan, Reut Tsarfaty,
Abstract summary: We argue that conflating different tasks by their context length is unproductive. We propose to unpack the taxonomy of long-context based on the properties that make them more difficult with longer contexts. We conclude that the most difficult and interesting settings, whose necessary information is very long and highly diffused within the input, is severely under-explored.
Score: 32.19010113355365
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Improvements in language models' capabilities have pushed their applications towards longer contexts, making long-context evaluation and development an active research area. However, many disparate use-cases are grouped together under the umbrella term of "long-context", defined simply by the total length of the model's input, including - for example - Needle-in-a-Haystack tasks, book summarization, and information aggregation. Given their varied difficulty, in this position paper we argue that conflating different tasks by their context length is unproductive. As a community, we require a more precise vocabulary to understand what makes long-context tasks similar or different. We propose to unpack the taxonomy of long-context based on the properties that make them more difficult with longer contexts. We propose two orthogonal axes of difficulty: (I) Diffusion: How hard is it to find the necessary information in the context? (II) Scope: How much necessary information is there to find? We survey the literature on long-context, provide justification for this taxonomy as an informative descriptor, and situate the literature with respect to it. We conclude that the most difficult and interesting settings, whose necessary information is very long and highly diffused within the input, is severely under-explored. By using a descriptive vocabulary and discussing the relevant properties of difficulty in long-context, we can implement more informed research in this area. We call for a careful design of tasks and benchmarks with distinctly long context, taking into account the characteristics that make it qualitatively different from shorter context.

Related papers

PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts [50.77454873238667]
We introduce PRELUDE, a benchmark for evaluating long-context understanding through the task of determining whether a character's prequel story is consistent with the canonical narrative of the original book.<n>Our task poses a stronger demand for global comprehension and deep reasoning than existing benchmarks -- as the prequels are not part of the original story, assessing their plausibility typically requires searching and integrating information that is only indirectly related.<n> Experimental results highlight the challenge of our task: in-context learning, RAG and in-domain training with state-of-the-art LLMs, and commercial DeepResearch services, lag behind humans by
arXiv Detail & Related papers (2025-08-13T14:28:25Z)
LiveLongBench: Tackling Long-Context Understanding for Spoken Texts from Live Streams [4.917265821383127]
We construct the first spoken long-text dataset, derived from live streams, to reflect the redundancy-rich and conversational nature of real-world scenarios. We evaluate both popular LLMs and specialized methods to assess their ability to understand long-contexts in these tasks. Our findings highlight key limitations of current methods and suggest future directions for improving long-context understanding.
arXiv Detail & Related papers (2025-04-24T08:27:48Z)
WildLong: Synthesizing Realistic Long-Context Instruction Data at Scale [86.25450054683172]
WildLong extracts meta-information from real user queries to produce scalable data. It supports multi-document reasoning, such as cross-document comparison and aggregation. It surpasses existing open-source long-context-optimized models across benchmarks.
arXiv Detail & Related papers (2025-02-23T18:59:09Z)
Generalizing From Short to Long: Effective Data Synthesis for Long-Context Instruction Tuning [103.65680870130839]
We investigate how to design instruction data for the post-training phase of a long context pre-trained model. Our controlled study reveals that models instruction-tuned on short contexts can effectively generalize to longer ones. Based on these findings, we propose context synthesis, a novel data synthesis framework.
arXiv Detail & Related papers (2025-02-21T17:02:40Z)
ACER: Automatic Language Model Context Extension via Retrieval [36.40066695682234]
Current open-weight generalist long-context models are still lacking in practical long-context processing tasks. We build an textbfautomatic data synthesis pipeline that mimics this process using short-context LMs. The short-context LMs are further tuned using these self-generated data to obtain task-specific long-context capabilities.
arXiv Detail & Related papers (2024-10-11T17:57:06Z)
Long-context Language Models Fail in Basic Retrieval Tasks Without Sufficient Reasoning Steps [21.811552521030137]
Long-context language models (LCLMs) characterized by their extensive context window are becoming popular.<n>Despite the fact that they are nearly perfect at standard long-context retrieval tasks, our evaluations demonstrate they fail in some basic cases.<n>We find they can be well addressed with a sufficient number of reasoning steps, guided by specific CoT prompts.
arXiv Detail & Related papers (2024-10-06T09:29:19Z)
DetectiveQA: Evaluating Long-Context Reasoning on Detective Novels [89.51834016940153]
We introduce DetectiveQA, a narrative reasoning benchmark with an average context length of over 100K tokens. We use detective novels as data sources, which naturally have various reasoning elements. We manually annotated 600 questions in Chinese and then also provided an English edition of the context information and questions.
arXiv Detail & Related papers (2024-09-04T06:28:22Z)
NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window? [37.64593022203498]
NeedleBench is a framework consisting of progressively more challenging tasks for assessing bilingual long-context capabilities. We use the framework to assess how well the leading open-source models can identify key information relevant to the question. We propose the Ancestral Trace Challenge to mimic the complexity of logical reasoning challenges that are likely to be present in real-world long-context tasks.
arXiv Detail & Related papers (2024-07-16T17:59:06Z)
Why does in-context learning fail sometimes? Evaluating in-context learning on open and closed questions [14.999106867218572]
We measure the performance of in-context learning as a function of task novelty and difficulty for open and closed questions. We show that counter-intuitively, a context that is more aligned with the topic does not always help more than a less relevant context.
arXiv Detail & Related papers (2024-07-02T07:52:30Z)
Make Your LLM Fully Utilize the Context [70.89099306100155]
We show that FILM-7B can robustly retrieve information from different positions in its 32K context window. FILM-7B significantly improves the performance on real-world long-context tasks.
arXiv Detail & Related papers (2024-04-25T17:55:14Z)
NovelQA: Benchmarking Question Answering on Documents Exceeding 200K Tokens [63.7488938083696]
NovelQA is a benchmark designed to test the capabilities of Large Language Models with extended texts. This paper presents the design and construction of NovelQA, highlighting its manual annotation, and diverse question types. Our evaluation of Long-context LLMs on NovelQA reveals significant insights into the models' performance.
arXiv Detail & Related papers (2024-03-18T17:32:32Z)
Quantifying the redundancy between prosody and text [67.07817268372743]
We use large language models to estimate how much information is redundant between prosody and the words themselves. We find a high degree of redundancy between the information carried by the words and prosodic information across several prosodic features. Still, we observe that prosodic features can not be fully predicted from text, suggesting that prosody carries information above and beyond the words.
arXiv Detail & Related papers (2023-11-28T21:15:24Z)
Lost in the Middle: How Language Models Use Long Contexts [88.78803442320246]
We analyze the performance of language models on two tasks that require identifying relevant information in their input contexts. We find that performance can degrade significantly when changing the position of relevant information. Our analysis provides a better understanding of how language models use their input context and provides new evaluation protocols for future long-context language models.
arXiv Detail & Related papers (2023-07-06T17:54:11Z)
SCROLLS: Standardized CompaRison Over Long Language Sequences [62.574959194373264]
We introduce SCROLLS, a suite of tasks that require reasoning over long texts. SCROLLS contains summarization, question answering, and natural language inference tasks. We make all datasets available in a unified text-to-text format and host a live leaderboard to facilitate research on model architecture and pretraining methods.
arXiv Detail & Related papers (2022-01-10T18:47:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.