LazyEviction: Lagged KV Eviction with Attention Pattern Observation for Efficient Long Reasoning
- URL: http://arxiv.org/abs/2506.15969v3
- Date: Wed, 15 Oct 2025 01:55:31 GMT
- Title: LazyEviction: Lagged KV Eviction with Attention Pattern Observation for Efficient Long Reasoning
- Authors: Haoyue Zhang, Hualei Zhang, Xiaosong Ma, Jie Zhang, Song Guo,
- Abstract summary: extended reasoning sequences introduce significant GPU memory overhead due to increased key-value (KV) cache.<n>Existing KV cache compression methods mitigate memory bottlenecks but struggle in long reasoning tasks.<n>We propose LazyEviction, an observation window-based lagged eviction framework retaining latent recurring tokens by prioritized eviction based on tokens' recurrence patterns.
- Score: 21.761205124793175
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) exhibit enhanced capabilities by Chain-of-Thought reasoning. However, the extended reasoning sequences introduce significant GPU memory overhead due to increased key-value (KV) cache. Existing KV cache compression methods mitigate memory bottlenecks but struggle in long reasoning tasks. In this paper, we analyze attention patterns in reasoning tasks and reveal a Token Importance Recurrence phenomenon: a large proportion of tokens regain high attention after multiple decoding steps, which is failed to capture by existing works and may lead to unpredictable eviction on such periodically critical tokens. To address this, we propose LazyEviction, an observation window-based lagged eviction framework retaining latent recurring tokens by prioritized eviction based on tokens' recurrence patterns. Extensive experiments demonstrate that LazyEviction reduces KV cache by 50%~70% while maintaining comparable accuracy, outperforming existing KV cache compression baselines. Our implementation code can be found at https://github.com/Halo-949/LazyEviction.
Related papers
- ForesightKV: Optimizing KV Cache Eviction for Reasoning Models by Learning Long-Term Contribution [84.41751286055909]
We develop a training-based KV cache eviction framework that learns to predict which KV pairs to evict during longtext generations.<n>We formulate cache eviction as a Markov Decision Process and apply the GRPO algorithm to mitigate the significant language modeling loss increase on low-entropy tokens.
arXiv Detail & Related papers (2026-02-03T07:16:51Z) - G-KV: Decoding-Time KV Cache Eviction with Global Attention [57.47409249054187]
Large language models (LLMs) excel in complex tasks but encounter significant computational and memory challenges due to long sequence lengths.<n> KV cache compression has emerged as an effective approach to greatly enhance the efficiency of reasoning.<n>We propose G-KV, a KV cache eviction method that employs a global scoring mechanism, combining local and historical attention scores to more accurately assess token importance.
arXiv Detail & Related papers (2025-11-29T14:21:33Z) - OBCache: Optimal Brain KV Cache Pruning for Efficient Long-Context LLM Inference [11.315090790312041]
We propose a principled framework that formulates cache eviction as a layer-wise structured pruning problem.<n>We measure the perturbation in attention outputs induced by pruning tokens, with closed-form scores derived for isolated keys, isolated values, and joint key-value pairs.<n>Our scores account not only for attention weights but also for information from value states and attention outputs, thereby enhancing existing eviction strategies with output-aware signals.
arXiv Detail & Related papers (2025-10-09T00:58:28Z) - Judge Q: Trainable Queries for Optimized Information Retention in KV Cache Eviction [53.83828564664595]
Large language models (LLMs) utilize key-value ( KV) cache to store historical information during sequence processing.<n>Current methods for KV cache eviction typically utilize the last window from the pre-filling phase as queries to compute the KV importance scores for eviction.<n>We propose Judge Q, a novel training method which incorporates a soft token list.
arXiv Detail & Related papers (2025-09-13T03:34:12Z) - Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction [58.044803442346115]
Diffusion Large Language Models (dLLMs) enable breakthroughs in reasoning and parallel decoding but suffer from prohibitive computational complexity and memory overhead during inference.<n>We propose Sparse-dLLM, the first training-free framework integrating dynamic cache eviction with sparse attention via delayed bidirectional sparse caching.
arXiv Detail & Related papers (2025-08-04T16:14:03Z) - CaliDrop: KV Cache Compression with Calibration [44.722738059962296]
Large Language Models (LLMs) require substantial computational resources during generation.<n> KV cache compression techniques have been proposed to mitigate this bottleneck.<n>This paper focuses on enhancing token eviction strategies.
arXiv Detail & Related papers (2025-07-26T10:34:53Z) - Multipole Attention for Efficient Long Context Reasoning [64.94673641704289]
Large Reasoning Models (LRMs) have shown promising accuracy improvements on complex problem-solving tasks.<n>LRMs need to generate long chain-of-thought reasoning in order to think before answering.<n>We introduce Multipole Attention, which accelerates autoregressive reasoning by only computing exact attention for the most important tokens.
arXiv Detail & Related papers (2025-06-16T03:00:40Z) - Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query [48.52389201779425]
KV cache memory usage grows substantially with longer text sequences.<n>Existing KV cache eviction methods prune tokens using prefilling-stage attention scores.<n>Lookahead Q-Cache generates low-cost pseudo lookahead queries to better approximate the true decoding-stage queries.
arXiv Detail & Related papers (2025-05-24T10:34:38Z) - ThinkLess: A Training-Free Inference-Efficient Method for Reducing Reasoning Redundancy [8.962703809086628]
ThinkLess is an inference-efficient framework that terminates reasoning generation early and maintains output quality without modifying the model.<n>We show that ThinkLess achieves comparable accuracy to full-length Chain-of-Thought (CoT) decoding while greatly reducing decoding time and memory consumption.
arXiv Detail & Related papers (2025-05-21T15:58:16Z) - Log-Augmented Generation: Scaling Test-Time Reasoning with Reusable Computation [80.69067017594709]
Large language models (LLMs) and their agentic counterparts struggle to retain reasoning from previous tasks.<n>We propose a novel framework, log-augmented generation (LAG) that directly reuses prior computation and reasoning from past logs at test time.<n>Our method significantly outperforms standard agentic systems that do not utilize logs.
arXiv Detail & Related papers (2025-05-20T14:14:38Z) - ConCISE: Confidence-guided Compression in Step-by-step Efficient Reasoning [75.1101108949743]
Large Reasoning Models (LRMs) perform strongly in complex reasoning tasks via Chain-of-Thought (CoT) prompting.<n>LRMs often suffer from verbose outputs caused by redundant content, increasing computational overhead, and degrading user experience.<n>We propose ConCISE, a framework that simplifies reasoning chains by reinforcing the model's confidence during inference.
arXiv Detail & Related papers (2025-05-08T01:40:40Z) - CAOTE: KV Caching through Attention Output Error based Token Eviction [6.1346213444758355]
Token eviction is a widely adopted post-training methodology designed to alleviate the bottlenecks by evicting less important tokens from the cache.<n>We propose a simple eviction criterion based on the contribution of cached tokens to attention outputs.<n>We show that CAOTE, when combined with the state-of-the-art attention score-based methods, always improves accuracies on the downstream task.
arXiv Detail & Related papers (2025-04-18T19:46:54Z) - AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference [51.1972443343829]
We propose AttentionPredictor, which is the first learning-based critical token identification approach.<n> AttentionPredictor accurately predicts the attention score while consuming negligible memory.<n>We also propose a cross-token critical cache prefetching framework that hides the token time overhead to accelerate the decoding stage.
arXiv Detail & Related papers (2025-02-06T13:41:46Z) - Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity [24.118503938098307]
textscPoD allocates memory according to token importance.<n>textscPoD reduces KV cache memory usage by up to 35% without compromising performance.
arXiv Detail & Related papers (2024-12-03T08:29:27Z) - RefreshKV: Updating Small KV Cache During Long-form Generation [54.00118604124301]
We propose a new inference method, RefreshKV, that flexibly alternates between full context attention and attention over a subset of input tokens during generation.<n>Applying our method to off-the-shelf LLMs achieves comparable speedup to eviction-based methods while improving performance for various long-form generation tasks.
arXiv Detail & Related papers (2024-11-08T18:57:07Z) - CItruS: Chunked Instruction-aware State Eviction for Long Sequence Modeling [52.404072802235234]
We introduce Chunked Instruction-aware State Eviction (CItruS), a novel modeling technique that integrates the attention preferences useful for a downstream task into the eviction process of hidden states.
Our training-free method exhibits superior performance on long sequence comprehension and retrieval tasks over several strong baselines under the same memory budget.
arXiv Detail & Related papers (2024-06-17T18:34:58Z) - Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference [78.65321721142624]
We focus on a memory bottleneck imposed by the key-value ( KV) cache.
Existing KV cache methods approach this problem by pruning or evicting large swaths of relatively less important KV pairs.
We propose LESS, a simple integration of a constant sized cache with eviction-based cache methods.
arXiv Detail & Related papers (2024-02-14T18:54:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.