Related papers: Learning to Evict from Key-Value Cache

Learning to Evict from Key-Value Cache

URL: http://arxiv.org/abs/2602.10238v1
Date: Tue, 10 Feb 2026 19:34:15 GMT
Title: Learning to Evict from Key-Value Cache
Authors: Luca Moschella, Laura Manduchi, Ozan Sener,
Abstract summary: We introduce KV Policy, a framework for learning to rank tokens by their predicted usefulness for future decoding.<n> evaluated across two different model families on the long-context benchmark RULER and the multi-turn dialogue benchmark OASST2-4k.<n>Results demonstrate that learning to predict future token utility is a powerful and scalable paradigm for adaptive KV cache management.
Score: 17.365511268829703
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The growing size of Large Language Models (LLMs) makes efficient inference challenging, primarily due to the memory demands of the autoregressive Key-Value (KV) cache. Existing eviction or compression methods reduce cost but rely on heuristics, such as recency or past attention scores, which serve only as indirect proxies for a token's future utility and introduce computational overhead. We reframe KV cache eviction as a reinforcement learning (RL) problem: learning to rank tokens by their predicted usefulness for future decoding. To this end, we introduce KV Policy (KVP), a framework of lightweight per-head RL agents trained on pre-computed generation traces using only key and value vectors. Each agent learns a specialized eviction policy guided by future utility, which evaluates the quality of the ranking across all cache budgets, requiring no modifications to the underlying LLM or additional inference. Evaluated across two different model families on the long-context benchmark RULER and the multi-turn dialogue benchmark OASST2-4k, KVP significantly outperforms baselines. Furthermore, zero-shot tests on standard downstream tasks (e.g., LongBench, BOOLQ, ARC) indicate that KVP generalizes well beyond its training distribution and to longer context lengths. These results demonstrate that learning to predict future token utility is a powerful and scalable paradigm for adaptive KV cache management.

Related papers

ForesightKV: Optimizing KV Cache Eviction for Reasoning Models by Learning Long-Term Contribution [84.41751286055909]
We develop a training-based KV cache eviction framework that learns to predict which KV pairs to evict during longtext generations.<n>We formulate cache eviction as a Markov Decision Process and apply the GRPO algorithm to mitigate the significant language modeling loss increase on low-entropy tokens.
arXiv Detail & Related papers (2026-02-03T07:16:51Z)
Randomization Boosts KV Caching, Learning Balances Query Load: A Joint Perspective [31.67506313325633]
KV caching is a technique for accelerating Large Language Model (LLM) inference by reusing key-value ( KV) pairs from previous queries.<n>The default Least Recently Used (LRU) eviction algorithm struggles with dynamic online query arrivals.<n>We give the first unified mathematical model that captures the core trade-offs between KV cache eviction and query routing.
arXiv Detail & Related papers (2026-01-26T22:20:59Z)
Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction [50.99402504483692]
We propose a novel gating-based KV cache eviction method for frozen-weight language models.<n>Our approach integrates seamlessly into both the prefill and decoding stages.<n>Experiments show that our method maintains near-lossless performance while evicting up to 70% of the KV cache.
arXiv Detail & Related papers (2026-01-25T03:07:54Z)
Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs [26.951325519894525]
We propose a novel approach that learns each token's intrinsic importance at creation time via a lightweight retention gate.<n>We show that it consistently outperforms strong eviction and learnable retrieval baselines, especially in low-memory regimes.<n>It even surpasses full-cache models in some settings, showing that selective retention can serve as a form of regularization.
arXiv Detail & Related papers (2025-12-03T00:20:35Z)
G-KV: Decoding-Time KV Cache Eviction with Global Attention [57.47409249054187]
Large language models (LLMs) excel in complex tasks but encounter significant computational and memory challenges due to long sequence lengths.<n> KV cache compression has emerged as an effective approach to greatly enhance the efficiency of reasoning.<n>We propose G-KV, a KV cache eviction method that employs a global scoring mechanism, combining local and historical attention scores to more accurately assess token importance.
arXiv Detail & Related papers (2025-11-29T14:21:33Z)
Judge Q: Trainable Queries for Optimized Information Retention in KV Cache Eviction [53.83828564664595]
Large language models (LLMs) utilize key-value ( KV) cache to store historical information during sequence processing.<n>Current methods for KV cache eviction typically utilize the last window from the pre-filling phase as queries to compute the KV importance scores for eviction.<n>We propose Judge Q, a novel training method which incorporates a soft token list.
arXiv Detail & Related papers (2025-09-13T03:34:12Z)
Log-Augmented Generation: Scaling Test-Time Reasoning with Reusable Computation [80.69067017594709]
Large language models (LLMs) and their agentic counterparts struggle to retain reasoning from previous tasks.<n>We propose a novel framework, log-augmented generation (LAG) that directly reuses prior computation and reasoning from past logs at test time.<n>Our method significantly outperforms standard agentic systems that do not utilize logs.
arXiv Detail & Related papers (2025-05-20T14:14:38Z)
PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation [97.41972925670508]
Large vision-language models (LVLMs) incur significant computational and memory overhead during inference.<n>We present PrefixKV, where "Prefix" means the top-ranked KV based on importance rather than position in the original sequence.<n>Our method achieves the state-of-the-art performance compared with others.
arXiv Detail & Related papers (2024-12-04T15:48:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.