R-KV: Redundancy-aware KV Cache Compression for Reasoning Models
- URL: http://arxiv.org/abs/2505.24133v3
- Date: Fri, 13 Jun 2025 21:01:43 GMT
- Title: R-KV: Redundancy-aware KV Cache Compression for Reasoning Models
- Authors: Zefan Cai, Wen Xiao, Hanshi Sun, Cheng Luo, Yikai Zhang, Ke Wan, Yucheng Li, Yeyang Zhou, Li-Wen Chang, Jiuxiang Gu, Zhen Dong, Anima Anandkumar, Abedelkadir Asi, Junjie Hu,
- Abstract summary: We propose Redundancy-aware KV Cache Compression for Reasoning models (R-KV)<n>R-KV preserves nearly 100% of the full KV cache performance using only 10% of the KV cache.<n>Remarkably, R-KV even achieves 105% of full KV cache performance with 16% of the KV cache.
- Score: 77.84539432982307
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reasoning models have demonstrated impressive performance in self-reflection and chain-of-thought reasoning. However, they often produce excessively long outputs, leading to prohibitively large key-value (KV) caches during inference. While chain-of-thought inference significantly improves performance on complex reasoning tasks, it can also lead to reasoning failures when deployed with existing KV cache compression approaches. To address this, we propose Redundancy-aware KV Cache Compression for Reasoning models (R-KV), a novel method specifically targeting redundant tokens in reasoning models. Our method preserves nearly 100% of the full KV cache performance using only 10% of the KV cache, substantially outperforming existing KV cache baselines, which reach only 60% of the performance. Remarkably, R-KV even achieves 105% of full KV cache performance with 16% of the KV cache. This KV-cache reduction also leads to a 90% memory saving and a 6.6X throughput over standard chain-of-thought reasoning inference. Experimental results show that R-KV consistently outperforms existing KV cache compression baselines across two mathematical reasoning datasets.
Related papers
- ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration [81.81027217759433]
Large language models (LLMs) have achieved remarkable performance, yet their capability on long-context reasoning is often constrained by the excessive memory required to store the Key-Value ( KV) cache.<n>Recent methods have explored reducing the hidden dimensions of the KV cache, but many introduce additional computation through projection layers or suffer from significant performance degradation under high compression ratios.<n>We propose ReCalKV, a post-training KV cache compression method that reduces the hidden dimensions of the KV cache.
arXiv Detail & Related papers (2025-05-30T08:49:27Z) - KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction [37.489744618357655]
Transformer-based large language models (LLMs) cache context as key-value ( KV) pairs during inference.<n>As context length grows, KV cache sizes expand, leading to substantial memory overhead and increased attention latency.<n>This paper introduces KVzip, a query-agnostic KV cache eviction method enabling effective reuse of compressed KV caches across diverse queries.
arXiv Detail & Related papers (2025-05-29T13:05:47Z) - DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance [125.81664663201282]
We introduce a new KV cache compression method dubbed DBudgetKV.<n>It features an attention-based metric to signal when the remaining KV cache is unlikely to match the full-cache performance, then halting the pruning process.<n>Our method is easy to integrate within LLM inference, not only optimizing memory space, but also showing reduced inference time compared to existing methods.
arXiv Detail & Related papers (2025-02-24T06:33:39Z) - BaKlaVa -- Budgeted Allocation of KV cache for Long-context Inference [6.222836318380985]
BaKlaVa is a method to allocate optimal memory for individual KV-caches across the model.<n>We evaluate our method on LLaMA-3-8B, and Qwen2.5-7B models.
arXiv Detail & Related papers (2025-02-18T04:08:29Z) - More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression [71.42818367729573]
In large language models (LLMs), the memory usage of KV cache has become a critical bottleneck during inference.<n>The mainstream KV compression methods, including KV pruning and KV quantization, primarily focus on either token or precision dimension separately.<n>In this paper, we comprehensively investigate the token-precision trade-off in KV cache compression.
arXiv Detail & Related papers (2024-12-17T09:20:31Z) - Unifying KV Cache Compression for Large Language Models with LeanKV [28.452123478834803]
Large language models (LLMs) exhibit exceptional performance but incur significant serving costs due to their substantial memory requirements.<n>Existing KV cache compression techniques, such as quantization and pruning, apply uniform treatment to both keys and values, and discard unimportant tokens entirely.<n>We introduce LeanKV, a framework that advances KV cache compression by exploiting three levels of differentiation in the KV cache.
arXiv Detail & Related papers (2024-12-04T08:51:23Z) - KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing [58.29726147780976]
We propose a plug-and-play method called textit KVSharer, which shares the KV cache between layers to achieve layer-wise compression.
Experiments show that textit KVSharer can reduce KV cache computation by 30%, thereby lowering memory consumption.
We verify that textit KVSharer is compatible with existing intra-layer KV cache compression methods, and combining both can further save memory.
arXiv Detail & Related papers (2024-10-24T08:06:41Z) - CSKV: Training-Efficient Channel Shrinking for KV Cache in Long-Context Scenarios [13.144156413032896]
We introduce CSKV, a training-efficient Channel Shrinking technique for KV cache compression.
We show that CSKV can reduce the memory overhead of the KV cache by 80% while maintaining the model's long-context capability.
Our method can be seamlessly combined with quantization to further reduce the memory overhead, achieving a compression ratio of up to 95%.
arXiv Detail & Related papers (2024-09-16T17:36:50Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z) - PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling [38.732413451399]
Pyramid KV is a novel and effective KV cache compression method.<n>We show that Pyramid KV matches the performance of models with a full KV cache while retaining only 12% of the KV cache.<n>In the Needle-in-a-Haystack experiment, Pyramid KV outperforms competing methods in maintaining long-context comprehension.
arXiv Detail & Related papers (2024-06-04T07:51:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.