Related papers: CaliDrop: KV Cache Compression with Calibration

CaliDrop: KV Cache Compression with Calibration

URL: http://arxiv.org/abs/2507.19906v2
Date: Mon, 04 Aug 2025 08:19:26 GMT
Title: CaliDrop: KV Cache Compression with Calibration
Authors: Yi Su, Quantong Qiu, Yuechi Zhou, Juntao Li, Qingrong Xia, Ping Li, Xinyu Duan, Zhefeng Wang, Min Zhang,
Abstract summary: Large Language Models (LLMs) require substantial computational resources during generation.<n> KV cache compression techniques have been proposed to mitigate this bottleneck.<n>This paper focuses on enhancing token eviction strategies.
Score: 44.722738059962296
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) require substantial computational resources during generation. While the Key-Value (KV) cache significantly accelerates this process by storing attention intermediates, its memory footprint grows linearly with sequence length, batch size, and model size, creating a bottleneck in long-context scenarios. Various KV cache compression techniques, including token eviction, quantization, and low-rank projection, have been proposed to mitigate this bottleneck, often complementing each other. This paper focuses on enhancing token eviction strategies. Token eviction leverages the observation that the attention patterns are often sparse, allowing for the removal of less critical KV entries to save memory. However, this reduction usually comes at the cost of notable accuracy degradation, particularly under high compression ratios. To address this issue, we propose \textbf{CaliDrop}, a novel strategy that enhances token eviction through calibration. Our preliminary experiments show that queries at nearby positions exhibit high similarity. Building on this observation, CaliDrop performs speculative calibration on the discarded tokens to mitigate the accuracy loss caused by token eviction. Extensive experiments demonstrate that CaliDrop significantly improves the accuracy of existing token eviction methods.

Related papers

MixKVQ: Query-Aware Mixed-Precision KV Cache Quantization for Long-Context Reasoning [30.527521568636242]
Long Chain-of-Thought (CoT) reasoning has significantly advanced the capabilities of Large Language Models (LLMs)<n>Existing low-bit quantization methods often exhibit severe performance degradation on complex reasoning tasks.<n>We propose MixKVQ, a novel plug-and-play method that introduces a lightweight, query-aware algorithm to identify and preserve critical key channels.
arXiv Detail & Related papers (2025-12-22T09:44:26Z)
G-KV: Decoding-Time KV Cache Eviction with Global Attention [57.47409249054187]
Large language models (LLMs) excel in complex tasks but encounter significant computational and memory challenges due to long sequence lengths.<n> KV cache compression has emerged as an effective approach to greatly enhance the efficiency of reasoning.<n>We propose G-KV, a KV cache eviction method that employs a global scoring mechanism, combining local and historical attention scores to more accurately assess token importance.
arXiv Detail & Related papers (2025-11-29T14:21:33Z)
Judge Q: Trainable Queries for Optimized Information Retention in KV Cache Eviction [53.83828564664595]
Large language models (LLMs) utilize key-value ( KV) cache to store historical information during sequence processing.<n>Current methods for KV cache eviction typically utilize the last window from the pre-filling phase as queries to compute the KV importance scores for eviction.<n>We propose Judge Q, a novel training method which incorporates a soft token list.
arXiv Detail & Related papers (2025-09-13T03:34:12Z)
LazyEviction: Lagged KV Eviction with Attention Pattern Observation for Efficient Long Reasoning [21.761205124793175]
extended reasoning sequences introduce significant GPU memory overhead due to increased key-value (KV) cache.<n>Existing KV cache compression methods mitigate memory bottlenecks but struggle in long reasoning tasks.<n>We propose LazyEviction, an observation window-based lagged eviction framework retaining latent recurring tokens by prioritized eviction based on tokens' recurrence patterns.
arXiv Detail & Related papers (2025-06-19T02:25:04Z)
ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration [81.81027217759433]
Large language models (LLMs) are often constrained by the excessive memory required to store the Key-Value ( KV) cache.<n>Recent methods have explored reducing the hidden dimensions of the KV cache, but many introduce additional computation through projection layers.<n>We propose ReCalKV, a post-training KV cache compression method that reduces the hidden dimensions of the KV cache.
arXiv Detail & Related papers (2025-05-30T08:49:27Z)
Accurate KV Cache Quantization with Outlier Tokens Tracing [44.722738059962296]
KV Cache quantization presents a promising solution, striking a good balance between memory usage and accuracy.<n>Previous research has shown that the Keys are distributed by channel, while the Values are distributed by token.<n>Our method achieves significant accuracy improvements under 2-bit quantization and can deliver a 6.4 times reduction in memory usage and a 2.3 times increase in throughput.
arXiv Detail & Related papers (2025-05-16T07:23:12Z)
KVCrush: Key value cache size-reduction using similarity in head-behaviour [40.792661186062396]
Key-value (KV) caching has emerged as a crucial optimization technique for accelerating inference in large language models (LLMs)<n>However, the memory footprint of the KV is a huge bottleneck for model deployment directly impacting the model's batch size.<n>We propose KVCrush which can be combined with many KV compression technologies to improve the model accuracy at a much smaller memory.
arXiv Detail & Related papers (2025-02-24T02:57:51Z)
AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference [51.1972443343829]
We propose AttentionPredictor, which is the first learning-based critical token identification approach.<n> AttentionPredictor accurately predicts the attention score while consuming negligible memory.<n>We also propose a cross-token critical cache prefetching framework that hides the token time overhead to accelerate the decoding stage.
arXiv Detail & Related papers (2025-02-06T13:41:46Z)
More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression [71.42818367729573]
In large language models (LLMs), the memory usage of KV cache has become a critical bottleneck during inference.<n>The mainstream KV compression methods, including KV pruning and KV quantization, primarily focus on either token or precision dimension separately.<n>In this paper, we comprehensively investigate the token-precision trade-off in KV cache compression.
arXiv Detail & Related papers (2024-12-17T09:20:31Z)
ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z)
Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression [87.5604418100301]
Key-value( KV) caching is an important technique to accelerate the inference of large language models. Existing methods often compromise precision or require extra data for calibration. We introduce textbfDecoQuant, a novel data-free low-bit quantization technique based on tensor decomposition methods.
arXiv Detail & Related papers (2024-05-21T08:35:10Z)
No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization [31.806112535762367]
Key-Value (KV) Caching has become an essential technique for accelerating the inference speed and throughput of generative Large Language Models(LLMs)
arXiv Detail & Related papers (2024-02-28T06:34:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.