Overflow Prevention Enhances Long-Context Recurrent LLMs
- URL: http://arxiv.org/abs/2505.07793v1
- Date: Mon, 12 May 2025 17:45:05 GMT
- Title: Overflow Prevention Enhances Long-Context Recurrent LLMs
- Authors: Assaf Ben-Kish, Itamar Zimerman, M. Jehanzeb Mirza, James Glass, Leonid Karlinsky, Raja Giryes,
- Abstract summary: A recent trend in LLMs is developing recurrent sub-quadratic models that improve long-context processing efficiency.<n>We investigate leading large long-context models, focusing on how their fixed-size recurrent memory affects their performance.<n>Our experiments reveal that, even when these models are trained for extended contexts, their use of long contexts remains underutilized.
- Score: 41.8230324612529
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A recent trend in LLMs is developing recurrent sub-quadratic models that improve long-context processing efficiency. We investigate leading large long-context models, focusing on how their fixed-size recurrent memory affects their performance. Our experiments reveal that, even when these models are trained for extended contexts, their use of long contexts remains underutilized. Specifically, we demonstrate that a chunk-based inference procedure, which identifies and processes only the most relevant portion of the input can mitigate recurrent memory failures and be effective for many long-context tasks: On LongBench, our method improves the overall performance of Falcon3-Mamba-Inst-7B by 14%, Falcon-Mamba-Inst-7B by 28%, RecurrentGemma-IT-9B by 50%, and RWKV6-Finch-7B by 51%. Surprisingly, this simple approach also leads to state-of-the-art results in the challenging LongBench v2 benchmark, showing competitive performance with equivalent size Transformers. Furthermore, our findings raise questions about whether recurrent models genuinely exploit long-range dependencies, as our single-chunk strategy delivers stronger performance - even in tasks that presumably require cross-context relations.
Related papers
- Compress, Gather, and Recompute: REFORMing Long-Context Processing in Transformers [58.98923344096319]
REFORM is a novel inference framework that efficiently handles long contexts through a two-phase approach.<n>It achieves over 50% and 27% performance gains on RULER and BABILong respectively at 1M context length.<n>It also outperforms baselines on Infinite-Bench and MM-NIAH, demonstrating flexibility across diverse tasks and domains.
arXiv Detail & Related papers (2025-06-01T23:49:14Z) - ParallelComp: Parallel Long-Context Compressor for Length Extrapolation [51.68913021512016]
ParallelComp is a training-free method for long-context extrapolation.<n>It extends context length from 4K to 128K while maintaining high throughput and preserving perplexity.<n>Our analysis offers new insights into attention biases in parallel attention mechanisms.
arXiv Detail & Related papers (2025-02-20T07:10:43Z) - Does RAG Really Perform Bad For Long-Context Processing? [15.889864680212147]
RetroLM is a novel framework for long-context processing.<n>Unlike traditional methods, RetroLM employs KV-level retrieval augmentation.<n>Building on this framework, we develop a specialized retriever for precise retrieval of critical pages.
arXiv Detail & Related papers (2025-02-17T05:02:25Z) - LCIRC: A Recurrent Compression Approach for Efficient Long-form Context and Query Dependent Modeling in LLMs [10.84210988032097]
We introduce Long-form Context Injection with Recurrent Compression (LCIRC), a method that enables the efficient processing long-form sequences beyond the model's length limit.<n>We also introduce query dependent context modeling, which selectively compresses query-relevant information, ensuring that the model retains the most pertinent content.
arXiv Detail & Related papers (2025-02-10T04:02:18Z) - Breaking the Context Bottleneck on Long Time Series Forecasting [6.36010639533526]
Long-term time-series forecasting is essential for planning and decision-making in economics, energy, and transportation.<n>Recent advancements have enhanced the efficiency of these models, but the challenge of effectively leveraging longer sequences persists.<n>We propose the Logsparse Decomposable Multiscaling (LDM) framework for the efficient and effective processing of long sequences.
arXiv Detail & Related papers (2024-12-21T10:29:34Z) - Why Does the Effective Context Length of LLMs Fall Short? [68.34573617977013]
In this work, we introduce ShifTed Rotray position embeddING (STRING)
STRING shifts well-trained positions to overwrite the original ineffective positions during inference, enhancing performance within their existing training lengths.
Experimental results show that STRING dramatically improves the performance of the latest large-scale models.
arXiv Detail & Related papers (2024-10-24T13:51:50Z) - Rethinking Token Reduction for State Space Models [47.00760373683448]
We propose a tailored, unified post-training token reduction method for State Space Models (SSMs)
Our approach integrates token importance and similarity, thus taking advantage of both pruning and merging.
Our method improves the average accuracy by 5.7% to 13.1% on six benchmarks with Mamba-2 compared to existing methods.
arXiv Detail & Related papers (2024-10-16T00:06:13Z) - Untie the Knots: An Efficient Data Augmentation Strategy for Long-Context Pre-Training in Language Models [21.90388980448712]
Training models to handle long contexts presents significant challenges.
We introduce Untie the Knots (textbfUtK), a novel data augmentation strategy employed during the continue pre-training phase.
We conduct extensive experiments on models with 7B and 72B parameters, trained on 20 billion tokens, demonstrating that UtK achieves 75% and 84.5% accurracy on RULER at 128K context length.
arXiv Detail & Related papers (2024-09-07T09:28:55Z) - ReMamba: Equip Mamba with Effective Long-Sequence Modeling [50.530839868893786]
We propose ReMamba, which enhances Mamba's ability to comprehend long contexts.<n>ReMamba incorporates selective compression and adaptation techniques within a two-stage re-forward process.
arXiv Detail & Related papers (2024-08-28T02:47:27Z) - FactorLLM: Factorizing Knowledge via Mixture of Experts for Large Language Models [50.331708897857574]
We introduce FactorLLM, a novel approach that decomposes well-trained dense FFNs into sparse sub-networks without requiring any further modifications.
FactorLLM achieves comparable performance to the source model securing up to 85% model performance while obtaining over a 30% increase in inference speed.
arXiv Detail & Related papers (2024-08-15T16:45:16Z) - Effective Long-Context Scaling of Foundation Models [90.57254298730923]
We present a series of long-context LLMs that support effective context windows of up to 32,768 tokens.
Our models achieve consistent improvements on most regular tasks and significant improvements on long-context tasks over Llama 2.
arXiv Detail & Related papers (2023-09-27T21:41:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.