Which Heads Matter for Reasoning? RL-Guided KV Cache Compression
- URL: http://arxiv.org/abs/2510.08525v1
- Date: Thu, 09 Oct 2025 17:50:00 GMT
- Title: Which Heads Matter for Reasoning? RL-Guided KV Cache Compression
- Authors: Wenjie Du, Li Jiang, Keda Tao, Xue Liu, Huan Wang,
- Abstract summary: Reasoning large language models exhibit complex reasoning behaviors through the extended chain-of-thought generation.<n>Existing KV cache compression methods underperform on reasoning models.<n>We propose RLKV, a novel reasoning-critical head identification framework.
- Score: 15.865990296257413
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reasoning large language models exhibit complex reasoning behaviors through the extended chain-of-thought generation, creating unprecedented Key-Value (KV) cache overhead during the decoding phase. Existing KV cache compression methods underperform on reasoning models: token-dropping methods break reasoning integrity by discarding critical information, while head-reallocating methods mistakenly compress reasoning-critical heads since they are designed for retrieval tasks, resulting in significant performance degradation as compression rates increase. We hypothesize that KV heads exhibit functional heterogeneity in reasoning models-some heads are critical for chain-of-thought consistency while others are compressible. To validate and exploit this insight, we propose RLKV, a novel reasoning-critical head identification framework, which uses reinforcement learning to directly optimize the relationship between each head's cache usage and reasoning quality. As RLKV produces rewards from actual generated samples during training, it naturally identifies heads relevant to reasoning behaviors. We then allocate full KV cache to these heads while applying compressed constant KV cache to others for efficient inference. Our experiments reveal that only a small fraction of attention heads is essential for reasoning, enabling our KV compression approach to outperform baseline methods while achieving 20-50% cache reduction with near lossless performance compared to uncompressed results.
Related papers
- G-KV: Decoding-Time KV Cache Eviction with Global Attention [57.47409249054187]
Large language models (LLMs) excel in complex tasks but encounter significant computational and memory challenges due to long sequence lengths.<n> KV cache compression has emerged as an effective approach to greatly enhance the efficiency of reasoning.<n>We propose G-KV, a KV cache eviction method that employs a global scoring mechanism, combining local and historical attention scores to more accurately assess token importance.
arXiv Detail & Related papers (2025-11-29T14:21:33Z) - Breadcrumbs Reasoning: Memory-Efficient Reasoning with Compression Beacons [22.085345397844687]
We propose to periodically compress the generation KV cache with a learned, special-purpose token.<n>We train the model to perform this compression via a modified joint distillation and reinforcement learning framework.<n>Our method achieves a superior memory-accuracy frontier compared to both the model without cache compression and training-free compression techniques.
arXiv Detail & Related papers (2025-10-15T17:57:21Z) - ThinKV: Thought-Adaptive KV Cache Compression for Efficient Reasoning Models [13.284627477293322]
ThinKV is a thought-adaptive KV cache compression framework.<n>It applies a hybrid quantization-eviction strategy, assigning token precision by thought importance.<n>Experiments on DeepSeek-R1-Distill, GPT-OSS, and NVIDIA AceReason show that ThinKV achieves near-lossless accuracy with less than 5% of the original KV cache.
arXiv Detail & Related papers (2025-10-01T04:09:02Z) - CommonKV: Compressing KV Cache with Cross-layer Parameter Sharing [54.34080239841088]
CommonKV is a training-free method for cross-layer KV cache compression through adjacent parameters sharing.<n>We show that the proposed method consistently outperforms existing low-rank and cross-layer approaches at various compression ratios.
arXiv Detail & Related papers (2025-08-22T06:55:45Z) - ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration [69.57122277845293]
We propose ReCalKV, a post-training low-rank KV cache compression approach with tailored strategies for Keys and Values.<n>For Keys, we propose Similarity aware Recontext (HSR), which clusters structurally similar heads into groups, enabling more accurate low-rank approximation.<n>For Values, we propose Offline Head-wise Value (OVC), which efficiently calibrates the value projection matrix using calibration data without training.
arXiv Detail & Related papers (2025-05-30T08:49:27Z) - R-KV: Redundancy-aware KV Cache Compression for Reasoning Models [77.84539432982307]
We propose Redundancy-aware KV Cache Compression for Reasoning models (R-KV)<n>R-KV preserves nearly 100% of the full KV cache performance using only 10% of the KV cache.<n>Remarkably, R-KV even achieves 105% of full KV cache performance with 16% of the KV cache.
arXiv Detail & Related papers (2025-05-30T02:03:24Z) - PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation [97.41972925670508]
Large vision-language models (LVLMs) incur significant computational and memory overhead during inference.<n>We present PrefixKV, where "Prefix" means the top-ranked KV based on importance rather than position in the original sequence.<n>Our method achieves the state-of-the-art performance compared with others.
arXiv Detail & Related papers (2024-12-04T15:48:59Z) - Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning [19.942402563256962]
Key-Value (KV) caching is a common technique to enhance the computational efficiency of Large Language Models (LLMs)
We propose HeadKV, a head-level KV cache compression method, and Head KV-R2, which leverages a novel contextual reasoning ability estimation for compression.
Our method retains just 1.5% of the KV cache while achieving 97% of the performance of the full KV cache on the contextual question answering benchmark.
arXiv Detail & Related papers (2024-10-25T02:22:00Z) - RazorAttention: Efficient KV Cache Compression Through Retrieval Heads [11.708388082001074]
We propose a novel compression technique for Key-Value cache that preserves all token information.
RazorAttention achieves a reduction in KV cache size by over 70% without noticeable impacts on performance.
arXiv Detail & Related papers (2024-07-22T01:12:23Z) - Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs [82.08922896531618]
We introduce adaptive KV cache compression, a plug-and-play method that reduces the memory footprint of generative inference for Large Language Models (LLMs)
We conduct targeted profiling to discern the intrinsic structure of attention modules.
Based on the recognized structure, we then construct the KV cache in an adaptive manner: evicting long-range contexts on attention heads emphasizing local contexts, discarding non-special tokens on attention heads centered on special tokens, and only employing the standard KV cache for attention heads that broadly attend to all tokens.
arXiv Detail & Related papers (2023-10-03T05:17:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.