EvolKV: Evolutionary KV Cache Compression for LLM Inference
- URL: http://arxiv.org/abs/2509.08315v1
- Date: Wed, 10 Sep 2025 06:32:49 GMT
- Title: EvolKV: Evolutionary KV Cache Compression for LLM Inference
- Authors: Bohan Yu, Yekun Chai,
- Abstract summary: EvolKV is an adaptive framework for layer-wise, task-driven KV cache compression.<n>We show EvolKV achieves superior performance over the full KV code completion while utilizing only 1.5% of the original budget.
- Score: 16.100469422266045
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing key-value (KV) cache compression methods typically rely on heuristics, such as uniform cache allocation across layers or static eviction policies, however, they ignore the critical interplays among layer-specific feature patterns and task performance, which can lead to degraded generalization. In this paper, we propose EvolKV, an adaptive framework for layer-wise, task-driven KV cache compression that jointly optimizes the memory efficiency and task performance. By reformulating cache allocation as a multi-objective optimization problem, EvolKV leverages evolutionary search to dynamically configure layer budgets while directly maximizing downstream performance. Extensive experiments on 11 tasks demonstrate that our approach outperforms all baseline methods across a wide range of KV cache budgets on long-context tasks and surpasses heuristic baselines by up to 7 percentage points on GSM8K. Notably, EvolKV achieves superior performance over the full KV cache setting on code completion while utilizing only 1.5% of the original budget, suggesting the untapped potential in learned compression strategies for KV cache budget allocation.
Related papers
- Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction [50.99402504483692]
We propose a novel gating-based KV cache eviction method for frozen-weight language models.<n>Our approach integrates seamlessly into both the prefill and decoding stages.<n>Experiments show that our method maintains near-lossless performance while evicting up to 70% of the KV cache.
arXiv Detail & Related papers (2026-01-25T03:07:54Z) - EVICPRESS: Joint KV-Cache Compression and Eviction for Efficient LLM Serving [27.616284276071855]
Reusing KV cache is essential for high efficiency of Large Language Model (LLM) inference systems.<n>Prior work has proposed to either evict KV cache to lower-tier storage devices, or compress KV cache so that more KV cache can be fit in the fast memory.<n>We propose EVICPRESS, a KV-cache management system that applies lossy compression and adaptive eviction to KV cache across multiple storage tiers.
arXiv Detail & Related papers (2025-12-16T22:21:55Z) - CommonKV: Compressing KV Cache with Cross-layer Parameter Sharing [54.34080239841088]
CommonKV is a training-free method for cross-layer KV cache compression through adjacent parameters sharing.<n>We show that the proposed method consistently outperforms existing low-rank and cross-layer approaches at various compression ratios.
arXiv Detail & Related papers (2025-08-22T06:55:45Z) - R-KV: Redundancy-aware KV Cache Compression for Reasoning Models [77.84539432982307]
We propose Redundancy-aware KV Cache Compression for Reasoning models (R-KV)<n>R-KV preserves nearly 100% of the full KV cache performance using only 10% of the KV cache.<n>Remarkably, R-KV even achieves 105% of full KV cache performance with 16% of the KV cache.
arXiv Detail & Related papers (2025-05-30T02:03:24Z) - WindowKV: Task-Adaptive Group-Wise KV Cache Window Selection for Efficient LLM Inference [9.572076809796448]
We propose a novel task-adaptive KV cache window selection method, WindowKV.<n>We show that WindowKV maintains a performance comparable to full KV cache retention while using only 12% of the original KV cache.<n>Our method also achieves state-of-the-art results in the Needle-in-a-Haystack evaluation, highlighting its effectiveness and robustness.
arXiv Detail & Related papers (2025-03-23T03:36:52Z) - DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance [125.81664663201282]
We introduce a new KV cache compression method dubbed DBudgetKV.<n>It features an attention-based metric to signal when the remaining KV cache is unlikely to match the full-cache performance.<n>Our method achieves lossless KV pruning effectively and robustly, exceeding 25% compression ratio on average.
arXiv Detail & Related papers (2025-02-24T06:33:39Z) - DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs [31.62076958302603]
Existing KV cache compression methods enforce a fixed pattern, neglecting task-specific characteristics and reducing the retention of essential information.<n>We propose DynamicKV, a method that dynamically optimize token retention by adjusting the number of tokens retained at each layer to adapt to the specific task.<n>Our method retains only 1.7% of the KV cache size while achieving 85% of the Full KV cache performance on LongBench.
arXiv Detail & Related papers (2024-12-19T13:28:42Z) - LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z) - Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference [37.94892570127548]
Large Language Models have excelled in various domains but face efficiency challenges due to the growing Key-Value (KV) cache.<n>Recent efforts aim to reduce KV cache size by evicting vast non-critical cache elements during runtime.<n>We propose Ada-KV, the first head-wise adaptive budget allocation strategy.
arXiv Detail & Related papers (2024-07-16T09:53:32Z) - PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling [38.732413451399]
Pyramid KV is a novel and effective KV cache compression method.<n>We show that Pyramid KV matches the performance of models with a full KV cache while retaining only 12% of the KV cache.<n>In the Needle-in-a-Haystack experiment, Pyramid KV outperforms competing methods in maintaining long-context comprehension.
arXiv Detail & Related papers (2024-06-04T07:51:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.