The Pitfalls of KV Cache Compression
- URL: http://arxiv.org/abs/2510.00231v1
- Date: Tue, 30 Sep 2025 19:55:26 GMT
- Title: The Pitfalls of KV Cache Compression
- Authors: Alex Chen, Renato Geh, Aditya Grover, Guy Van den Broeck, Daniel Israel,
- Abstract summary: We show that certain instructions degrade much more rapidly with compression.<n>We show several factors that play a role in prompt leakage: compression method, instruction order, and KV eviction bias.
- Score: 52.196873305708955
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: KV cache compression promises increased throughput and efficiency with negligible loss in performance. While the gains in throughput are indisputable and recent literature has indeed shown minimal degradation on particular benchmarks, in general the consequences of compression in realistic scenarios such as multi-instruction prompting have been insufficiently studied. In this paper, we identify several pitfalls practitioners should be aware of when deploying KV cache compressed LLMs. Importantly, we show that certain instructions degrade much more rapidly with compression, effectively causing them to be completely ignored by the LLM. As a practical example of that, we highlight system prompt leakage as a case study, empirically showing the impact of compression on leakage and general instruction following. We show several factors that play a role in prompt leakage: compression method, instruction order, and KV eviction bias. We then propose simple changes to KV cache eviction policies that can reduce the impact of these factors and improve the overall performance in multi-instruction tasks.
Related papers
- More Than a Quick Glance: Overcoming the Greedy Bias in KV-Cache Compression [0.0]
LASER-KV is a framework designed to test the limits of KV compression under a strict accumulative budgeting policy.<n>Experiments on the Babilong benchmark reveal performance degradation in previous compression methods by 15-30% on various long context tasks.
arXiv Detail & Related papers (2026-02-02T15:05:03Z) - Compressing Many-Shots in In-Context Learning [61.231471139896506]
We study an approach to improve the memory and computational efficiency of ICL inference by compressing the many-shot prompts.<n>We first show that existing prompt compression methods are ineffective for many-shot compression.<n>We propose MemCom, a layer-wise compression method.
arXiv Detail & Related papers (2025-10-17T16:57:42Z) - ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration [69.57122277845293]
We propose ReCalKV, a post-training low-rank KV cache compression approach with tailored strategies for Keys and Values.<n>For Keys, we propose Similarity aware Recontext (HSR), which clusters structurally similar heads into groups, enabling more accurate low-rank approximation.<n>For Values, we propose Offline Head-wise Value (OVC), which efficiently calibrates the value projection matrix using calibration data without training.
arXiv Detail & Related papers (2025-05-30T08:49:27Z) - KV-Distill: Nearly Lossless Learnable Context Compression for LLMs [37.0803484148612]
We introduce KV-Distill, a Transformer compression framework that distills long context KV caches into significantly shorter representations.<n> KV-Distill can be trained as a parameter-efficient adaptor for pretrained models.<n>It can be fine-tuned on domain-specific contexts to reduce lengths by up to 99% while preserving downstream performance.
arXiv Detail & Related papers (2025-03-13T13:15:28Z) - DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance [125.81664663201282]
We introduce a new KV cache compression method dubbed DBudgetKV.<n>It features an attention-based metric to signal when the remaining KV cache is unlikely to match the full-cache performance.<n>Our method achieves lossless KV pruning effectively and robustly, exceeding 25% compression ratio on average.
arXiv Detail & Related papers (2025-02-24T06:33:39Z) - Can LLMs Maintain Fundamental Abilities under KV Cache Compression? [29.510433427184385]
We present a benchmark KVFundaBench to evaluate the effects of KV cache compression across diverse fundamental language models.<n>We propose ShotKV, a novel compression approach that handles prefill and decoding phases while maintaining shot-level semantic coherence.
arXiv Detail & Related papers (2025-02-04T02:23:06Z) - LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - UNComp: Can Matrix Entropy Uncover Sparsity? -- A Compressor Design from an Uncertainty-Aware Perspective [85.08718140718707]
UNComp is an uncertainty-aware framework that uncovers sparsity patterns that can be used for adaptive compression.<n>By focusing on uncertainty to analyze the sparsity pattern in detail, UNComp reduces the KV cache size to 4.74% of the original, achieves a 6% prefill speedup, and improves throughput by 6.4x.
arXiv Detail & Related papers (2024-10-04T02:32:36Z) - Beyond Perplexity: Multi-dimensional Safety Evaluation of LLM Compression [33.45167213570976]
We investigate the impact of model compression on four dimensions: (1) degeneration harm, i.e., bias and toxicity in generation; (2) representational harm, i.e., biases in discriminative tasks; (3) dialect bias; and(4) language modeling and downstream task performance.
Our analysis reveals that compression can lead to unexpected consequences.
arXiv Detail & Related papers (2024-07-06T05:56:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.