Related papers: Make Your LLM Fully Utilize the Context

Make Your LLM Fully Utilize the Context

URL: http://arxiv.org/abs/2404.16811v2
Date: Fri, 26 Apr 2024 11:15:21 GMT
Title: Make Your LLM Fully Utilize the Context
Authors: Shengnan An, Zexiong Ma, Zeqi Lin, Nanning Zheng, Jian-Guang Lou,
Abstract summary: We show that FILM-7B can robustly retrieve information from different positions in its 32K context window. FILM-7B significantly improves the performance on real-world long-context tasks.
Score: 70.89099306100155
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While many contemporary large language models (LLMs) can process lengthy input, they still struggle to fully utilize information within the long context, known as the lost-in-the-middle challenge. We hypothesize that it stems from insufficient explicit supervision during the long-context training, which fails to emphasize that any position in a long context can hold crucial information. Based on this intuition, our study presents information-intensive (IN2) training, a purely data-driven solution to overcome lost-in-the-middle. Specifically, IN2 training leverages a synthesized long-context question-answer dataset, where the answer requires (1) fine-grained information awareness on a short segment (~128 tokens) within a synthesized long context (4K-32K tokens), and (2) the integration and reasoning of information from two or more short segments. Through applying this information-intensive training on Mistral-7B, we present FILM-7B (FILl-in-the-Middle). To thoroughly assess the ability of FILM-7B for utilizing long contexts, we design three probing tasks that encompass various context styles (document, code, and structured-data context) and information retrieval patterns (forward, backward, and bi-directional retrieval). The probing results demonstrate that FILM-7B can robustly retrieve information from different positions in its 32K context window. Beyond these probing tasks, FILM-7B significantly improves the performance on real-world long-context tasks (e.g., 23.5->26.9 F1 score on NarrativeQA), while maintaining a comparable performance on short-context tasks (e.g., 59.3->59.2 accuracy on MMLU). Github Link: https://github.com/microsoft/FILM.

Related papers

From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models [54.44375226381814]
Long-context capabilities are essential for a wide range of applications, including document and video understanding, in-context learning, and inference-time scaling. We introduce a efficient training recipe for building ultra-long context LLMs from aligned instruct model, pushing the boundaries of context lengths from 128K to 1M, 2M, and 4M tokens. Our approach achieves state-of-the-art performance across a diverse set of long-context benchmarks.
arXiv Detail & Related papers (2025-04-08T16:58:58Z)
Pause-Tuning for Long-Context Comprehension: A Lightweight Approach to LLM Attention Recalibration [4.7429246847107835]
We introduce pause-tuning, a technique that redistributes attention to enhance comprehension of long-context inputs. Our approach involves fine-tuning language models on datasets with artificially inserted pause tokens. We evaluate pause-tuning against alternative approaches using the Needle-in-a-Haystack benchmark.
arXiv Detail & Related papers (2025-02-01T21:47:15Z)
Harnessing Large Language Models for Knowledge Graph Question Answering via Adaptive Multi-Aspect Retrieval-Augmentation [81.18701211912779]
We introduce an Adaptive Multi-Aspect Retrieval-augmented over KGs (Amar) framework. This method retrieves knowledge including entities, relations, and subgraphs, and converts each piece of retrieved text into prompt embeddings. Our method has achieved state-of-the-art performance on two common datasets.
arXiv Detail & Related papers (2024-12-24T16:38:04Z)
Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks? [36.83397306207386]
We evaluate the capabilities of 17 leading Large Language Models (LLMs) Strikingly, many models are remarkably threadsafe: capable of simultaneously following multiple threads without significant loss in performance. We find the effective context limit is significantly shorter than the supported context length, with accuracy decreasing as the context window grows.
arXiv Detail & Related papers (2024-11-07T18:59:27Z)
Holistic Reasoning with Long-Context LMs: A Benchmark for Database Operations on Massive Textual Data [6.195658947075431]
We introduce HoloBench, a framework that brings database reasoning operations into text-based contexts. We show that the amount of information in the context has a bigger influence on LCLM performance than the context length. We find that tasks requiring the aggregation of multiple pieces of information show a noticeable drop in accuracy as context length increases.
arXiv Detail & Related papers (2024-10-15T19:04:13Z)
ALR$^2$: A Retrieve-then-Reason Framework for Long-context Question Answering [42.146660039671076]
We develop a retrieve-then-reason framework for large language models (LLMs) We find that modern LLMs struggle to accurately retrieve relevant facts and instead, often hallucinate "retrieved facts" We introduce ALR$2$, a method that augments the long-context reasoning capability of LLMs via an explicit two-stage procedure.
arXiv Detail & Related papers (2024-10-04T08:29:12Z)
DetectiveQA: Evaluating Long-Context Reasoning on Detective Novels [89.51834016940153]
We introduce DetectiveQA, a narrative reasoning benchmark with an average context length of over 100K tokens. We use detective novels as data sources, which naturally have various reasoning elements. We manually annotated 600 questions in Chinese and then also provided an English edition of the context information and questions.
arXiv Detail & Related papers (2024-09-04T06:28:22Z)
FocusLLM: Scaling LLM's Context by Parallel Decoding [16.642675785000176]
FocusLLM is a framework designed to extend the context length of any decoder-only LLM. FocusLLM processes long text inputs by dividing them into chunks based on the model's original context length. It appends the local context to each chunk as a prompt to extract essential information from each chunk based on a novel parallel decoding mechanism.
arXiv Detail & Related papers (2024-08-21T16:11:59Z)
ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities [53.97515452727115]
ChatQA 2 is a Llama 3.0-based model with a 128K context window. We present a training recipe to extend the context window of Llama3-70B-base from 8K to 128K tokens. Our results demonstrate that the Llama3-ChatQA-2-70B model outperforms most existing state-of-the-art models.
arXiv Detail & Related papers (2024-07-19T17:35:47Z)
NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window? [37.64593022203498]
NeedleBench is a framework consisting of progressively more challenging tasks for assessing bilingual long-context capabilities. We use the framework to assess how well the leading open-source models can identify key information relevant to the question. We propose the Ancestral Trace Challenge to mimic the complexity of logical reasoning challenges that are likely to be present in real-world long-context tasks.
arXiv Detail & Related papers (2024-07-16T17:59:06Z)
QuickLLaMA: Query-aware Inference Acceleration for Large Language Models [94.82978039567236]
We introduce Query-aware Inference for Large Language Models (Q-LLM) Q-LLM is designed to process extensive sequences akin to human cognition. It can accurately capture pertinent information within a fixed window size and provide precise answers to queries.
arXiv Detail & Related papers (2024-06-11T17:55:03Z)
Retrieval meets Long Context Large Language Models [59.431200671427064]
Extending context window of large language models (LLMs) is getting popular recently. Retrieval-augmentation versus long context window, which one is better for downstream tasks? Can both methods be combined to get the best of both worlds? Our best model, retrieval-augmented Llama2-70B with 32K context window, outperforms GPT-3.5-turbo-16k and Davinci003 in terms of average score on nine long context tasks.
arXiv Detail & Related papers (2023-10-04T17:59:41Z)
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding [58.20031627237889]
LongBench is the first bilingual, multi-task benchmark for long context understanding. It comprises 21 datasets across 6 task categories in both English and Chinese, with an average length of 6,711 words (English) and 13,386 characters (Chinese)
arXiv Detail & Related papers (2023-08-28T11:53:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.