ForesightKV: Optimizing KV Cache Eviction for Reasoning Models by Learning Long-Term Contribution
- URL: http://arxiv.org/abs/2602.03203v1
- Date: Tue, 03 Feb 2026 07:16:51 GMT
- Title: ForesightKV: Optimizing KV Cache Eviction for Reasoning Models by Learning Long-Term Contribution
- Authors: Zican Dong, Peiyu Liu, Junyi Li, Zhipeng Chen, Han Peng, Shuo Wang, Wayne Xin Zhao,
- Abstract summary: We develop a training-based KV cache eviction framework that learns to predict which KV pairs to evict during longtext generations.<n>We formulate cache eviction as a Markov Decision Process and apply the GRPO algorithm to mitigate the significant language modeling loss increase on low-entropy tokens.
- Score: 84.41751286055909
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, large language models (LLMs) have shown remarkable reasoning abilities by producing long reasoning traces. However, as the sequence length grows, the key-value (KV) cache expands linearly, incurring significant memory and computation costs. Existing KV cache eviction methods mitigate this issue by discarding less important KV pairs, but often fail to capture complex KV dependencies, resulting in performance degradation. To better balance efficiency and performance, we introduce ForesightKV, a training-based KV cache eviction framework that learns to predict which KV pairs to evict during long-text generations. We first design the Golden Eviction algorithm, which identifies the optimal eviction KV pairs at each step using future attention scores. These traces and the scores at each step are then distilled via supervised training with a Pairwise Ranking Loss. Furthermore, we formulate cache eviction as a Markov Decision Process and apply the GRPO algorithm to mitigate the significant language modeling loss increase on low-entropy tokens. Experiments on AIME2024 and AIME2025 benchmarks of three reasoning models demonstrate that ForesightKV consistently outperforms prior methods under only half the cache budget, while benefiting synergistically from both supervised and reinforcement learning approaches.
Related papers
- Learning to Evict from Key-Value Cache [17.365511268829703]
We introduce KV Policy, a framework for learning to rank tokens by their predicted usefulness for future decoding.<n> evaluated across two different model families on the long-context benchmark RULER and the multi-turn dialogue benchmark OASST2-4k.<n>Results demonstrate that learning to predict future token utility is a powerful and scalable paradigm for adaptive KV cache management.
arXiv Detail & Related papers (2026-02-10T19:34:15Z) - Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction [50.99402504483692]
We propose a novel gating-based KV cache eviction method for frozen-weight language models.<n>Our approach integrates seamlessly into both the prefill and decoding stages.<n>Experiments show that our method maintains near-lossless performance while evicting up to 70% of the KV cache.
arXiv Detail & Related papers (2026-01-25T03:07:54Z) - SkipKV: Selective Skipping of KV Generation and Storage for Efficient Inference with Large Reasoning Models [25.509962883211]
Large reasoning models (LRMs) often cost significant key-value (KV) cache overhead, due to their linear growth with the chain-of-thought (CoT) reasoning process.<n>We present textbfSkipKV, a KV compression method for selective textiteviction and textitgeneration operating at a coarse-grained sentence-level sequence removal.
arXiv Detail & Related papers (2025-12-08T19:32:06Z) - G-KV: Decoding-Time KV Cache Eviction with Global Attention [57.47409249054187]
Large language models (LLMs) excel in complex tasks but encounter significant computational and memory challenges due to long sequence lengths.<n> KV cache compression has emerged as an effective approach to greatly enhance the efficiency of reasoning.<n>We propose G-KV, a KV cache eviction method that employs a global scoring mechanism, combining local and historical attention scores to more accurately assess token importance.
arXiv Detail & Related papers (2025-11-29T14:21:33Z) - LouisKV: Efficient KV Cache Retrieval for Long Input-Output Sequences [12.093166735658626]
Key-Value (KV) cache succeeds in reducing redundant computations in auto-regressive models.<n>It introduces significant memory overhead, limiting its practical deployment in long-sequence scenarios.<n>Existing KV retrieval methods suffer from notable efficiency and accuracy bottlenecks due to per-token retrieval and coarse-grained page-level KV management.
arXiv Detail & Related papers (2025-10-13T11:28:30Z) - Judge Q: Trainable Queries for Optimized Information Retention in KV Cache Eviction [53.83828564664595]
Large language models (LLMs) utilize key-value ( KV) cache to store historical information during sequence processing.<n>Current methods for KV cache eviction typically utilize the last window from the pre-filling phase as queries to compute the KV importance scores for eviction.<n>We propose Judge Q, a novel training method which incorporates a soft token list.
arXiv Detail & Related papers (2025-09-13T03:34:12Z) - LazyEviction: Lagged KV Eviction with Attention Pattern Observation for Efficient Long Reasoning [21.761205124793175]
extended reasoning sequences introduce significant GPU memory overhead due to increased key-value (KV) cache.<n>Existing KV cache compression methods mitigate memory bottlenecks but struggle in long reasoning tasks.<n>We propose LazyEviction, an observation window-based lagged eviction framework retaining latent recurring tokens by prioritized eviction based on tokens' recurrence patterns.
arXiv Detail & Related papers (2025-06-19T02:25:04Z) - FlowKV: Enhancing Multi-Turn Conversational Coherence in LLMs via Isolated Key-Value Cache Management [48.904743679691414]
FlowKV is a novel multi-turn isolation mechanism for KV Cache management.<n>It preserves the accumulated compressed KV cache from past turns.<n>It prevents the re-compression of older context and thereby mitigating catastrophic forgetting.
arXiv Detail & Related papers (2025-05-21T10:20:46Z) - More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression [71.42818367729573]
In large language models (LLMs), the memory usage of KV cache has become a critical bottleneck during inference.<n>The mainstream KV compression methods, including KV pruning and KV quantization, primarily focus on either token or precision dimension separately.<n>In this paper, we comprehensively investigate the token-precision trade-off in KV cache compression.
arXiv Detail & Related papers (2024-12-17T09:20:31Z) - Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference [37.94892570127548]
Large Language Models have excelled in various domains but face efficiency challenges due to the growing Key-Value (KV) cache.<n>Recent efforts aim to reduce KV cache size by evicting vast non-critical cache elements during runtime.<n>We propose Ada-KV, the first head-wise adaptive budget allocation strategy.
arXiv Detail & Related papers (2024-07-16T09:53:32Z) - No Token Left Behind: Reliable KV Cache Compression via Importance-Aware
Mixed Precision Quantization [31.806112535762367]
Key-Value (KV) Caching has become an essential technique for accelerating the inference speed and throughput of generative Large Language Models(LLMs)
arXiv Detail & Related papers (2024-02-28T06:34:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.