Is It Really Long Context if All You Need Is Retrieval? Towards Genuinely Difficult Long Context NLP
- URL: http://arxiv.org/abs/2407.00402v3
- Date: Sun, 06 Oct 2024 09:09:26 GMT
- Title: Is It Really Long Context if All You Need Is Retrieval? Towards Genuinely Difficult Long Context NLP
- Authors: Omer Goldman, Alon Jacovi, Aviv Slobodkin, Aviya Maimon, Ido Dagan, Reut Tsarfaty,
- Abstract summary: We argue that conflating different tasks by their context length is unproductive.
We propose to unpack the taxonomy of long-context based on the properties that make them more difficult with longer contexts.
We conclude that the most difficult and interesting settings, whose necessary information is very long and highly diffused within the input, is severely under-explored.
- Score: 32.19010113355365
- License:
- Abstract: Improvements in language models' capabilities have pushed their applications towards longer contexts, making long-context evaluation and development an active research area. However, many disparate use-cases are grouped together under the umbrella term of "long-context", defined simply by the total length of the model's input, including - for example - Needle-in-a-Haystack tasks, book summarization, and information aggregation. Given their varied difficulty, in this position paper we argue that conflating different tasks by their context length is unproductive. As a community, we require a more precise vocabulary to understand what makes long-context tasks similar or different. We propose to unpack the taxonomy of long-context based on the properties that make them more difficult with longer contexts. We propose two orthogonal axes of difficulty: (I) Diffusion: How hard is it to find the necessary information in the context? (II) Scope: How much necessary information is there to find? We survey the literature on long-context, provide justification for this taxonomy as an informative descriptor, and situate the literature with respect to it. We conclude that the most difficult and interesting settings, whose necessary information is very long and highly diffused within the input, is severely under-explored. By using a descriptive vocabulary and discussing the relevant properties of difficulty in long-context, we can implement more informed research in this area. We call for a careful design of tasks and benchmarks with distinctly long context, taking into account the characteristics that make it qualitatively different from shorter context.
Related papers
- ACER: Automatic Language Model Context Extension via Retrieval [36.40066695682234]
Current open-weight generalist long-context models are still lacking in practical long-context processing tasks.
We build an textbfautomatic data synthesis pipeline that mimics this process using short-context LMs.
The short-context LMs are further tuned using these self-generated data to obtain task-specific long-context capabilities.
arXiv Detail & Related papers (2024-10-11T17:57:06Z) - DetectiveQA: Evaluating Long-Context Reasoning on Detective Novels [89.51834016940153]
We introduce DetectiveQA, a narrative reasoning benchmark with an average context length of over 100K tokens.
We use detective novels as data sources, which naturally have various reasoning elements.
We manually annotated 600 questions in Chinese and then also provided an English edition of the context information and questions.
arXiv Detail & Related papers (2024-09-04T06:28:22Z) - NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window? [37.64593022203498]
NeedleBench is a framework consisting of progressively more challenging tasks for assessing bilingual long-context capabilities.
We use the framework to assess how well the leading open-source models can identify key information relevant to the question.
We propose the Ancestral Trace Challenge to mimic the complexity of logical reasoning challenges that are likely to be present in real-world long-context tasks.
arXiv Detail & Related papers (2024-07-16T17:59:06Z) - Why does in-context learning fail sometimes? Evaluating in-context learning on open and closed questions [14.999106867218572]
We measure the performance of in-context learning as a function of task novelty and difficulty for open and closed questions.
We show that counter-intuitively, a context that is more aligned with the topic does not always help more than a less relevant context.
arXiv Detail & Related papers (2024-07-02T07:52:30Z) - Make Your LLM Fully Utilize the Context [70.89099306100155]
We show that FILM-7B can robustly retrieve information from different positions in its 32K context window.
FILM-7B significantly improves the performance on real-world long-context tasks.
arXiv Detail & Related papers (2024-04-25T17:55:14Z) - NovelQA: Benchmarking Question Answering on Documents Exceeding 200K Tokens [63.7488938083696]
NovelQA is a benchmark designed to test the capabilities of Large Language Models with extended texts.
This paper presents the design and construction of NovelQA, highlighting its manual annotation, and diverse question types.
Our evaluation of Long-context LLMs on NovelQA reveals significant insights into the models' performance.
arXiv Detail & Related papers (2024-03-18T17:32:32Z) - Quantifying the redundancy between prosody and text [67.07817268372743]
We use large language models to estimate how much information is redundant between prosody and the words themselves.
We find a high degree of redundancy between the information carried by the words and prosodic information across several prosodic features.
Still, we observe that prosodic features can not be fully predicted from text, suggesting that prosody carries information above and beyond the words.
arXiv Detail & Related papers (2023-11-28T21:15:24Z) - Lost in the Middle: How Language Models Use Long Contexts [88.78803442320246]
We analyze the performance of language models on two tasks that require identifying relevant information in their input contexts.
We find that performance can degrade significantly when changing the position of relevant information.
Our analysis provides a better understanding of how language models use their input context and provides new evaluation protocols for future long-context language models.
arXiv Detail & Related papers (2023-07-06T17:54:11Z) - SCROLLS: Standardized CompaRison Over Long Language Sequences [62.574959194373264]
We introduce SCROLLS, a suite of tasks that require reasoning over long texts.
SCROLLS contains summarization, question answering, and natural language inference tasks.
We make all datasets available in a unified text-to-text format and host a live leaderboard to facilitate research on model architecture and pretraining methods.
arXiv Detail & Related papers (2022-01-10T18:47:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.