LAVa: Layer-wise KV Cache Eviction with Dynamic Budget Allocation
- URL: http://arxiv.org/abs/2509.09754v1
- Date: Thu, 11 Sep 2025 16:48:24 GMT
- Title: LAVa: Layer-wise KV Cache Eviction with Dynamic Budget Allocation
- Authors: Yiqun Shen, Song Yuan, Zhengze Zhang, Xiaoliang Wang, Daxin Jiang, Nguyen Cam-Tu,
- Abstract summary: KV Cache is commonly used to accelerate LLM inference with long contexts.<n>Existing compression methods, however, are largely and lack dynamic budget allocation.<n>We introduce a unified framework for cache compression by minimizing information loss in Transformer residual streams.
- Score: 24.45300622331682
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: KV Cache is commonly used to accelerate LLM inference with long contexts, yet its high memory demand drives the need for cache compression. Existing compression methods, however, are largely heuristic and lack dynamic budget allocation. To address this limitation, we introduce a unified framework for cache compression by minimizing information loss in Transformer residual streams. Building on it, we analyze the layer attention output loss and derive a new metric to compare cache entries across heads, enabling layer-wise compression with dynamic head budgets. Additionally, by contrasting cross-layer information, we also achieve dynamic layer budgets. LAVa is the first unified strategy for cache eviction and dynamic budget allocation that, unlike prior methods, does not rely on training or the combination of multiple strategies. Experiments with benchmarks (LongBench, Needle-In-A-Haystack, Ruler, and InfiniteBench) demonstrate its superiority. Moreover, our experiments reveal a new insight: dynamic layer budgets are crucial for generation tasks (e.g., code completion), while dynamic head budgets play a key role in extraction tasks (e.g., extractive QA). As a fully dynamic compression method, LAVa consistently maintains top performance across task types. Our code is available at https://github.com/MGDDestiny/Lava.
Related papers
- HeteroCache: A Dynamic Retrieval Approach to Heterogeneous KV Cache Compression for Long-Context LLM Inference [14.17979669446161]
We propose HeteroCache, a training-free dynamic compression framework.<n>We show that HeteroCache achieves state-of-the-art performance on multiple long-context benchmarks and accelerates decoding by up to $3times$ compared to the original model in the 224K context.
arXiv Detail & Related papers (2026-01-20T07:35:06Z) - LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models [52.56008278458534]
LaCache is a training-free method for efficient and accurate generative inference of Large Language Models.<n>LaCache enables LLMs to address both of the critical challenges in long-range modeling: robust long-range capabilities and continuous generation without running out-of-memory.
arXiv Detail & Related papers (2025-07-14T19:09:57Z) - ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration [81.81027217759433]
Large language models (LLMs) are often constrained by the excessive memory required to store the Key-Value ( KV) cache.<n>Recent methods have explored reducing the hidden dimensions of the KV cache, but many introduce additional computation through projection layers.<n>We propose ReCalKV, a post-training KV cache compression method that reduces the hidden dimensions of the KV cache.
arXiv Detail & Related papers (2025-05-30T08:49:27Z) - Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query [48.52389201779425]
KV cache memory usage grows substantially with longer text sequences.<n>Existing KV cache eviction methods prune tokens using prefilling-stage attention scores.<n>Lookahead Q-Cache generates low-cost pseudo lookahead queries to better approximate the true decoding-stage queries.
arXiv Detail & Related papers (2025-05-24T10:34:38Z) - dKV-Cache: The Cache for Diffusion Language Models [53.85291644298835]
Diffusion Language Models (DLMs) have been seen as a promising competitor for autoregressive language models.<n>We propose a KV-cache-like mechanism, delayed KV-Cache, for the denoising process of DLMs.<n>Our approach is motivated by the observation that different tokens have distinct representation dynamics throughout the diffusion process.
arXiv Detail & Related papers (2025-05-21T17:32:10Z) - ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty [35.947737679664016]
As the inference length increases, growing KV caches might lead to out-of-memory issues.<n>This paper proposes a simple yet effective KV cache compression method that leverages layer uncertainty to allocate budget size for each layer.<n> Experimental results show that the proposed method can reduce memory usage of the KV caches to only $sim$20% when compared to Full KV inference.
arXiv Detail & Related papers (2024-12-12T07:52:56Z) - EvoPress: Accurate Dynamic Model Compression via Evolutionary Search [33.86918407429272]
EvoPress is an evolutionary framework for dynamic compression of large language models.<n>It identifies optimal compression profiles in a highly efficient manner.<n>We set new benchmarks for structural pruning (block/layer dropping), unstructured sparsity, and quantization with dynamic bitwidths.
arXiv Detail & Related papers (2024-10-18T17:46:37Z) - LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z) - Efficient Inference of Vision Instruction-Following Models with Elastic Cache [76.44955111634545]
We introduce Elastic Cache, a novel strategy for efficient deployment of instruction-following large vision-language models.
We propose an importance-driven cache merging strategy to prune redundancy caches.
For instruction encoding, we utilize the frequency to evaluate the importance of caches.
Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation.
arXiv Detail & Related papers (2024-07-25T15:29:05Z) - LoMA: Lossless Compressed Memory Attention [0.0]
Lossless Compressed Memory Attention (LoMA) is a novel approach to reduce memory and computational demands during autoregressive generation.
LoMA incorporates a specialized training or fine-tuning precedure alongside an autoregressive generation algorithm optimized for the compressed context.
Experimental validation has demonstrated that LoMA significantly reducing computational consumption and memory usage.
arXiv Detail & Related papers (2024-01-16T09:18:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.