Judge Q: Trainable Queries for Optimized Information Retention in KV Cache Eviction
- URL: http://arxiv.org/abs/2509.10798v1
- Date: Sat, 13 Sep 2025 03:34:12 GMT
- Title: Judge Q: Trainable Queries for Optimized Information Retention in KV Cache Eviction
- Authors: Yijun Liu, Yixuan Wang, Yuzhuang Xu, Shiyu Ji, Yang Xu, Qingfu Zhu, Wanxiang Che,
- Abstract summary: Large language models (LLMs) utilize key-value ( KV) cache to store historical information during sequence processing.<n>Current methods for KV cache eviction typically utilize the last window from the pre-filling phase as queries to compute the KV importance scores for eviction.<n>We propose Judge Q, a novel training method which incorporates a soft token list.
- Score: 53.83828564664595
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Large language models (LLMs) utilize key-value (KV) cache to store historical information during sequence processing. The size of KV cache grows linearly as the length of the sequence extends, which seriously affects memory usage and decoding efficiency. Current methods for KV cache eviction typically utilize the last window from the pre-filling phase as queries to compute the KV importance scores for eviction. Although this scheme is simple to implement, it tends to overly focus on local information, potentially leading to the neglect or omission of crucial global information. To mitigate this issue, we propose Judge Q, a novel training method which incorporates a soft token list. This method only tunes the model's embedding layer at a low training cost. By concatenating the soft token list at the end of the input sequence, we train these tokens' attention map to the original input sequence to align with that of the actual decoded tokens. In this way, the queries corresponding to the soft tokens can effectively capture global information and better evaluate the importance of the keys and values within the KV cache, thus maintaining decoding quality when KV cache is evicted. Under the same eviction budget, our method exhibits less performance degradation compared to existing eviction approaches. We validate our approach through experiments conducted on models such as Llama-3.1-8B-Instruct and Mistral-7B-Instruct-v0.3, using benchmarks including LongBench, RULER, and Needle-in-a-Haystack. Results indicate an improvement of approximately 1 point on the LongBench and over 3 points on RULER. This proposed methodology can be seamlessly integrated into existing open-source models with minimal training overhead, thereby enhancing performance in KV cache eviction scenarios.
Related papers
- Learning to Evict from Key-Value Cache [17.365511268829703]
We introduce KV Policy, a framework for learning to rank tokens by their predicted usefulness for future decoding.<n> evaluated across two different model families on the long-context benchmark RULER and the multi-turn dialogue benchmark OASST2-4k.<n>Results demonstrate that learning to predict future token utility is a powerful and scalable paradigm for adaptive KV cache management.
arXiv Detail & Related papers (2026-02-10T19:34:15Z) - ForesightKV: Optimizing KV Cache Eviction for Reasoning Models by Learning Long-Term Contribution [84.41751286055909]
We develop a training-based KV cache eviction framework that learns to predict which KV pairs to evict during longtext generations.<n>We formulate cache eviction as a Markov Decision Process and apply the GRPO algorithm to mitigate the significant language modeling loss increase on low-entropy tokens.
arXiv Detail & Related papers (2026-02-03T07:16:51Z) - Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs [26.951325519894525]
We propose a novel approach that learns each token's intrinsic importance at creation time via a lightweight retention gate.<n>We show that it consistently outperforms strong eviction and learnable retrieval baselines, especially in low-memory regimes.<n>It even surpasses full-cache models in some settings, showing that selective retention can serve as a form of regularization.
arXiv Detail & Related papers (2025-12-03T00:20:35Z) - G-KV: Decoding-Time KV Cache Eviction with Global Attention [57.47409249054187]
Large language models (LLMs) excel in complex tasks but encounter significant computational and memory challenges due to long sequence lengths.<n> KV cache compression has emerged as an effective approach to greatly enhance the efficiency of reasoning.<n>We propose G-KV, a KV cache eviction method that employs a global scoring mechanism, combining local and historical attention scores to more accurately assess token importance.
arXiv Detail & Related papers (2025-11-29T14:21:33Z) - OBCache: Optimal Brain KV Cache Pruning for Efficient Long-Context LLM Inference [11.315090790312041]
We propose a principled framework that formulates cache eviction as a layer-wise structured pruning problem.<n>We measure the perturbation in attention outputs induced by pruning tokens, with closed-form scores derived for isolated keys, isolated values, and joint key-value pairs.<n>Our scores account not only for attention weights but also for information from value states and attention outputs, thereby enhancing existing eviction strategies with output-aware signals.
arXiv Detail & Related papers (2025-10-09T00:58:28Z) - CAOTE: KV Cache Selection for LLMs via Attention Output Error-Based Token Eviction [6.1346213444758355]
Token eviction is a widely adopted post-training methodology designed to alleviate the bottlenecks by evicting less important tokens from the cache.<n>We propose a simple eviction criterion based on the contribution of cached tokens to attention outputs.<n>We show that CAOTE, when combined with the state-of-the-art attention score-based methods, always improves accuracies on the downstream task.
arXiv Detail & Related papers (2025-04-18T19:46:54Z) - LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference [16.83202690345235]
We propose Self-Attention Guided Eviction(SAGE-KV), a simple and effective KV eviction cache method for long-context inference.<n>After prefilling, our method performs a one-time top-k selection at both the token and head levels to compress the KV cache.<n>SAGE-KV achieves 4x higher memory efficiency with improved accuracy over the static KV cache selection method StreamLLM, and 2x higher memory efficiency with better accuracy than the dynamic KV cache selection method Quest.
arXiv Detail & Related papers (2025-03-11T20:45:02Z) - In-context KV-Cache Eviction for LLMs via Attention-Gate [12.732519329131392]
The KV-Cache technique has become the standard for the inference of large language models (LLMs)<n>This paper enables a novel dynamic KV-Cache eviction policy by injecting a lightweight module called Attention-Gate to the model.<n>We empirically evaluate the proposed approach across multiple scenarios, showing that effective eviction of redundant tokens can not only improve efficiency but also enhance performance.
arXiv Detail & Related papers (2024-10-15T05:01:19Z) - Efficient Inference of Vision Instruction-Following Models with Elastic Cache [76.44955111634545]
We introduce Elastic Cache, a novel strategy for efficient deployment of instruction-following large vision-language models.
We propose an importance-driven cache merging strategy to prune redundancy caches.
For instruction encoding, we utilize the frequency to evaluate the importance of caches.
Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation.
arXiv Detail & Related papers (2024-07-25T15:29:05Z) - Training-Free Exponential Context Extension via Cascading KV Cache [49.608367376911694]
We introduce a novel mechanism that leverages cascading sub-cache buffers to selectively retain the most relevant tokens.<n>Our method reduces prefill stage latency by a factor of 6.8 when compared to flash attention on 1M tokens.
arXiv Detail & Related papers (2024-06-24T03:59:17Z) - Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference [78.65321721142624]
We focus on a memory bottleneck imposed by the key-value ( KV) cache.
Existing KV cache methods approach this problem by pruning or evicting large swaths of relatively less important KV pairs.
We propose LESS, a simple integration of a constant sized cache with eviction-based cache methods.
arXiv Detail & Related papers (2024-02-14T18:54:56Z) - SubGen: Token Generation in Sublinear Time and Memory [48.35076900702408]
Large language models (LLMs) have extensive memory requirements for token generation.
In this work, we focus on developing an efficient compression technique for the KV cache.
We have devised a novel caching method with sublinear complexity, employing online clustering on key tokens and online $ell$ sampling on values.
Not only does this algorithm ensure a sublinear memory footprint and sublinear time complexity, but we also establish a tight error bound for our approach.
arXiv Detail & Related papers (2024-02-08T22:17:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.