Dynamic Long Context Reasoning over Compressed Memory via End-to-End Reinforcement Learning
- URL: http://arxiv.org/abs/2602.08382v1
- Date: Mon, 09 Feb 2026 08:33:11 GMT
- Title: Dynamic Long Context Reasoning over Compressed Memory via End-to-End Reinforcement Learning
- Authors: Zhuoen Chen, Dongfang Li, Meishan Zhang, Baotian Hu, Min Zhang,
- Abstract summary: We propose a framework for efficient long-context inference based on chunk-wise compression and selective memory recall.<n>The framework segments long inputs into chunks and encodes each chunk into compressed memory representations using a learned compressor.<n>It achieves up to a 2 times reduction in peak GPU memory usage and a 6 times inference speedup over MemAgent.
- Score: 47.87361916374891
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) face significant challenges in long-context processing, including quadratic computational costs, information forgetting, and the context fragmentation inherent in retrieval-augmented generation (RAG). We propose a cognitively inspired framework for efficient long-context inference based on chunk-wise compression and selective memory recall, rather than processing all raw tokens. The framework segments long inputs into chunks and encodes each chunk into compressed memory representations using a learned compressor. A gating module dynamically selects relevant memory blocks, which are then iteratively processed by a reasoning module with an evolving working memory to solve downstream tasks. The compressor and reasoner are jointly optimized via end-to-end reinforcement learning, while the gating module is trained separately as a classifier. Experimental results show that the proposed method achieves competitive accuracy on multi-hop reasoning benchmarks such as RULER-HQA, extrapolates context length from 7K to 1.75M tokens, and offers a favorable accuracy-efficiency trade-off compared to strong long-context baselines. In particular, it achieves up to a 2 times reduction in peak GPU memory usage and a 6 times inference speedup over MemAgent.
Related papers
- Cognitive Chunking for Soft Prompts: Accelerating Compressor Learning via Block-wise Causal Masking [28.492055407384495]
Long contexts increase inference latency, as the computational cost of self-attention grows quadratically with sequence length.<n>Existing methods typically compress the entire context indiscriminately into a set of memory tokens.<n>We propose Parallelized Iterative Compression (PIC), which restricts the receptive field of memory tokens to sequential local chunks.
arXiv Detail & Related papers (2026-02-15T03:58:13Z) - SimpleMem: Efficient Lifelong Memory for LLM Agents [73.74399447715052]
We introduce SimpleMem, an efficient memory framework based on semantic lossless compression.<n>We propose a three-stage pipeline designed to maximize information density and token utilization.<n> Experiments on benchmark datasets show that our method consistently outperforms baseline approaches in accuracy, retrieval efficiency, and inference cost.
arXiv Detail & Related papers (2026-01-05T21:02:49Z) - Goal-Directed Search Outperforms Goal-Agnostic Memory Compression in Long-Context Memory Tasks [2.7708222692419735]
How to enable human-like long-term memory in large language models (LLMs) has been a central question.<n>We present SUMER, an end-to-end reinforcement learning agent with verifiable reward (RLVR)<n>We demonstrate that a simple search method applied to raw data outperforms goal-agnostic and biased compression algorithms in current long-context memory tasks.
arXiv Detail & Related papers (2025-11-20T22:45:57Z) - UniGist: Towards General and Hardware-aligned Sequence-level Long Context Compression [86.33995240043936]
UniGist is a sequence-level long-context compression framework for large language models.<n>It efficiently preserves context information by replacing raw tokens with special compression tokens (gists) in a fine-grained manner.<n>Our scheme also supports flexible inference by allowing the actual removal of compressed tokens, resulting in real-time memory savings.
arXiv Detail & Related papers (2025-09-19T08:47:37Z) - Lag-Relative Sparse Attention In Long Context Training [8.365610885641276]
We propose Lag-Relative Sparse Attention(LRSA) anchored by the LagKV compression method for long context post-training.<n>Our method performs chunk-by-chunk prefilling, which selects the top K most relevant key-value pairs in a fixed-size lagging window.
arXiv Detail & Related papers (2025-06-13T06:49:53Z) - Compress, Gather, and Recompute: REFORMing Long-Context Processing in Transformers [58.98923344096319]
REFORM is a novel inference framework that efficiently handles long contexts through a two-phase approach.<n>It achieves over 50% and 27% performance gains on RULER and BABILong respectively at 1M context length.<n>It also outperforms baselines on Infinite-Bench and MM-NIAH, demonstrating flexibility across diverse tasks and domains.
arXiv Detail & Related papers (2025-06-01T23:49:14Z) - From Single to Multi-Granularity: Toward Long-Term Memory Association and Selection of Conversational Agents [79.87304940020256]
Large Language Models (LLMs) have been widely adopted in conversational agents.<n>MemGAS is a framework that enhances memory consolidation by constructing multi-granularity association, adaptive selection, and retrieval.<n> Experiments on four long-term memory benchmarks demonstrate that MemGAS outperforms state-of-the-art methods on both question answer and retrieval tasks.
arXiv Detail & Related papers (2025-05-26T06:13:07Z) - Reasoning Path Compression: Compressing Generation Trajectories for Efficient LLM Reasoning [14.33163594016033]
Reasoning Path Compression (RPC) is a training-free method that accelerates inference by leveraging the semantic sparsity of reasoning paths.<n>We show RPC improves generation throughput of QwQ-32B by up to 1.60$times$ compared to the inference with full KV cache.<n>Our findings demonstrate that semantic sparsity in reasoning traces can be effectively exploited for compression, offering a practical path toward efficient deployment of reasoning LLMs.
arXiv Detail & Related papers (2025-05-20T03:21:52Z) - UIO-LLMs: Unbiased Incremental Optimization for Long-Context LLMs [111.05657299071648]
UIO-LLMs is an incremental optimization approach for memory-enhanced transformers under long-context settings.<n>We refine the training process using the Truncated Backpropagation Through Time (TBPTT) algorithm.<n>UIO-LLMs successfully handle long context, such as extending the context window of Llama2-7b-chat from 4K to 100K tokens with minimal 2% additional parameters.
arXiv Detail & Related papers (2024-06-26T08:44:36Z) - Recurrent Context Compression: Efficiently Expanding the Context Window of LLM [22.595457889113668]
This work introduces a method called Recurrent Context Compression (RCC), designed to efficiently expand the context window length of Transformer-based large language models (LLMs)
We validated our approach on multiple tasks, achieving a compression rate of up to 32x on text reconstruction tasks with a BLEU4 score close to 0.95, and nearly 100% accuracy on a passkey retrieval task with a sequence length of 1M.
arXiv Detail & Related papers (2024-06-10T08:50:59Z) - Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs [61.40047491337793]
We present Hierarchical cOntext MERging (HOMER), a new training-free scheme designed to overcome the limitations of large language models.
HomeR uses a divide-and-conquer algorithm, dividing long inputs into manageable chunks.
A token reduction technique precedes each merging, ensuring memory usage efficiency.
arXiv Detail & Related papers (2024-04-16T06:34:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.