Adaptive KV-Cache Compression without Manually Setting Budget
- URL: http://arxiv.org/abs/2509.03136v1
- Date: Wed, 03 Sep 2025 08:38:40 GMT
- Title: Adaptive KV-Cache Compression without Manually Setting Budget
- Authors: Chenxia Tang, Jianchun Liu, Hongli Xu, Liusheng Huang,
- Abstract summary: Large language models (LLMs) inference relies heavily on KV-caches to accelerate autoregressive decoding.<n>Current KV-cache compression methods suffer from a Procrustes' bed problem.<n>We present GVote, an adaptive KV-cache compression scheme that eliminates manual budget specification.
- Score: 30.469232780086532
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) inference relies heavily on KV-caches to accelerate autoregressive decoding, but the resulting memory footprint grows rapidly with sequence length, posing significant efficiency challenges. Current KV-cache compression methods suffer from a Procrustes' bed problem: they force diverse workloads into fixed compression ratios, leading to suboptimal resource allocation and inference performance. To this end, we present GVote, an adaptive KV-cache compression scheme that eliminates manual budget specification while achieving superior accuracy-efficiency trade-offs. GVote operates on the principle that the important keys are the aggregation of keys required by future queries. The method predicts future query attention demands by Monte-Carlo style sampling potential queries and aggregating selected keys to determine the optimal cache budget without manual specification. Experimental evaluation demonstrates GVote's effectiveness across multiple benchmarks, including GSM8K, RULER and Longbench. Compared to baselines, GVote exhibits 2$\times$ memory reduction while the accuracy maintains higher or comparable.
Related papers
- KQ-SVD: Compressing the KV Cache with Provable Guarantees on Attention Fidelity [6.542188603141656]
Key-Value cache is central to the efficiency of large language models.<n>As sequence length and batch size grow, the cache becomes a major memory bottleneck.<n>We introduce KQ-SVD, a simple and computationally efficient method that directly performs an optimal low-rank decomposition of the attention matrix.
arXiv Detail & Related papers (2025-12-05T17:51:10Z) - ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration [81.81027217759433]
Large language models (LLMs) are often constrained by the excessive memory required to store the Key-Value ( KV) cache.<n>Recent methods have explored reducing the hidden dimensions of the KV cache, but many introduce additional computation through projection layers.<n>We propose ReCalKV, a post-training KV cache compression method that reduces the hidden dimensions of the KV cache.
arXiv Detail & Related papers (2025-05-30T08:49:27Z) - DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance [125.81664663201282]
We introduce a new KV cache compression method dubbed DBudgetKV.<n>It features an attention-based metric to signal when the remaining KV cache is unlikely to match the full-cache performance.<n>Our method achieves lossless KV pruning effectively and robustly, exceeding 25% compression ratio on average.
arXiv Detail & Related papers (2025-02-24T06:33:39Z) - PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation [97.41972925670508]
Large vision-language models (LVLMs) incur significant computational and memory overhead during inference.<n>We present PrefixKV, where "Prefix" means the top-ranked KV based on importance rather than position in the original sequence.<n>Our method achieves the state-of-the-art performance compared with others.
arXiv Detail & Related papers (2024-12-04T15:48:59Z) - ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression [10.003118268356017]
Long context poses significant challenges for inference efficiency.<n>We introduce ClusterKV, which recalls tokens at the granularity of semantic clusters.<n>Experiment results show that ClusterKV attains negligible accuracy loss across various tasks with 32k context lengths.
arXiv Detail & Related papers (2024-12-04T10:58:27Z) - LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z) - Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference [37.94892570127548]
Large Language Models have excelled in various domains but face efficiency challenges due to the growing Key-Value (KV) cache.<n>Recent efforts aim to reduce KV cache size by evicting vast non-critical cache elements during runtime.<n>We propose Ada-KV, the first head-wise adaptive budget allocation strategy.
arXiv Detail & Related papers (2024-07-16T09:53:32Z) - CORM: Cache Optimization with Recent Message for Large Language Model Inference [57.109354287786154]
We introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint.
CORM, a KV cache eviction policy, dynamically retains essential key-value pairs for inference without the need for model fine-tuning.
Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70% with negligible performance degradation across six tasks in LongBench.
arXiv Detail & Related papers (2024-04-24T16:11:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.