Pause-Tuning for Long-Context Comprehension: A Lightweight Approach to LLM Attention Recalibration
- URL: http://arxiv.org/abs/2502.20405v1
- Date: Sat, 01 Feb 2025 21:47:15 GMT
- Title: Pause-Tuning for Long-Context Comprehension: A Lightweight Approach to LLM Attention Recalibration
- Authors: James Begin, Namit Agrawal, Eshan Singh, Yicheng Fu, Sean O'Brien, Vasu Sharma, Kevin Zhu,
- Abstract summary: We introduce pause-tuning, a technique that redistributes attention to enhance comprehension of long-context inputs.<n>Our approach involves fine-tuning language models on datasets with artificially inserted pause tokens.<n>We evaluate pause-tuning against alternative approaches using the Needle-in-a-Haystack benchmark.
- Score: 4.7429246847107835
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: LLMs have demonstrated remarkable proficiency in understanding tasks but continue to struggle with long-context comprehension, particularly with content located in the middle of extensive inputs. This limitation, known as the Lost-in-the-Middle (LITM) problem, hinders models from fully processing and utilizing information across lengthy contexts. To address this issue, we introduce pause-tuning, a technique that redistributes attention to enhance comprehension of long-context inputs. Our approach involves fine-tuning language models on datasets with artificially inserted pause tokens, which serve to segment the input into smaller, more manageable parts. We evaluate pause-tuning against alternative approaches using the Needle-in-a-Haystack benchmark, where models must retrieve information embedded within contexts of up to 128K tokens. Experimental results demonstrate significant performance gains, with the LLaMA 3.2 3B Instruct model and the LLaMA 3.1 8B Instruct model improving by 10.61% and 3.57% respectively on average, suggesting that pause-tuning successfully enhances attention redistribution and improves long-context retention. The code and data are available at https://anonymous.4open.science/r/LITM-PauseTokens-7357.
Related papers
- END: Early Noise Dropping for Efficient and Effective Context Denoising [60.24648712022382]
Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks.<n>They are often distracted by irrelevant or noisy context in input sequences that degrades output quality.<n>We introduce Early Noise Dropping (textscEND), a novel approach to mitigate this issue without requiring fine-tuning the LLMs.
arXiv Detail & Related papers (2025-02-26T08:07:17Z) - NoLiMa: Long-Context Evaluation Beyond Literal Matching [100.00398424275501]
Recent large language models (LLMs) support contexts ranging from 128K to 1M tokens.<n>We evaluate 12 popular LLMs that claim to support contexts of at least 128K tokens.<n>While they perform well in short contexts, performance degrades significantly as context length increases.
arXiv Detail & Related papers (2025-02-07T18:49:46Z) - Knowing When to Stop: Dynamic Context Cutoff for Large Language Models [5.800837821046764]
Large language models (LLMs) process entire input contexts indiscriminately, which is inefficient in cases where the information required to answer a query is localized within the context.
We present dynamic context cutoff, a human-inspired method enabling LLMs to self-terminate processing upon acquiring sufficient task-relevant information.
arXiv Detail & Related papers (2025-02-03T03:38:29Z) - Reducing Distraction in Long-Context Language Models by Focused Learning [6.803882766744194]
We propose a novel training method that enhances Large Language Models' ability to discern relevant information.
During fine-tuning with long contexts, we employ a retriever to extract the most relevant segments.
We then introduce an auxiliary contrastive learning objective to explicitly ensure that outputs from the original context and the retrieved sub-context are closely aligned.
arXiv Detail & Related papers (2024-11-08T19:27:42Z) - What is Wrong with Perplexity for Long-context Language Modeling? [71.34933096461124]
Long-context inputs are crucial for large language models (LLMs) in tasks such as extended conversations, document summarization, and many-shot in-context learning.<n>Perplexity (PPL) has proven unreliable for assessing long-context capabilities.<n>We propose bfLongPPL, a novel metric that focuses on key tokens by employing a long-short context contrastive method to identify them.
arXiv Detail & Related papers (2024-10-31T09:39:28Z) - Untie the Knots: An Efficient Data Augmentation Strategy for Long-Context Pre-Training in Language Models [21.90388980448712]
Training models to handle long contexts presents significant challenges.
We introduce Untie the Knots (textbfUtK), a novel data augmentation strategy employed during the continue pre-training phase.
We conduct extensive experiments on models with 7B and 72B parameters, trained on 20 billion tokens, demonstrating that UtK achieves 75% and 84.5% accurracy on RULER at 128K context length.
arXiv Detail & Related papers (2024-09-07T09:28:55Z) - FocusLLM: Precise Understanding of Long Context by Dynamic Condensing [16.642675785000176]
FocusLLM is a framework designed to extend the fixed context length of any decoder-only LLM.<n>It employs the dynamic condensing process to distill crucial information from each chunk.<n>Ultimately, through the novel parallel decoding mechanism, FocusLLM can integrate the extracted information into its local context.
arXiv Detail & Related papers (2024-08-21T16:11:59Z) - ReAttention: Training-Free Infinite Context with Finite Attention Scope [65.91272939057592]
ReAttention is a training-free approach to support infinite context with finite attention scope under sufficient memory resources.
We validate the performance of ReAttention on the LongBench, L-Eval, and InfiniteBench and demonstrate that it is on par with traditional methods.
We also apply ReAttention on mainstream LLMs, including LLaMA3.1-8B and Mistral-v0.3-7B, enabling them to support context lengths of at least 1M and even expanding the context length of LLaMA3.2-3B-chat by 128$times$ to 4M without any further training in Needle-In-
arXiv Detail & Related papers (2024-07-21T14:23:37Z) - Found in the Middle: How Language Models Use Long Contexts Better via
Plug-and-Play Positional Encoding [78.36702055076456]
This paper introduces Multi-scale Positional.
(Ms-PoE) which is a simple yet effective plug-and-play approach to enhance the capacity of.
LLMs to handle relevant information located in the middle of the context.
arXiv Detail & Related papers (2024-03-05T04:58:37Z) - Training-Free Long-Context Scaling of Large Language Models [114.53296002607993]
We propose Dual Chunk Attention, which enables Llama2 70B to support context windows of more than 100k tokens without continual training.
By decomposing the attention for long sequences into chunk-based modules, DCA manages to effectively capture the relative positional information of tokens.
arXiv Detail & Related papers (2024-02-27T12:39:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.