WindowKV: Task-Adaptive Group-Wise KV Cache Window Selection for Efficient LLM Inference
- URL: http://arxiv.org/abs/2503.17922v2
- Date: Thu, 27 Mar 2025 14:11:37 GMT
- Title: WindowKV: Task-Adaptive Group-Wise KV Cache Window Selection for Efficient LLM Inference
- Authors: Youhui Zuo, Sibo Wei, Chen Zhang, Zhuorui Liu, Wenpeng Lu, Dawei Song,
- Abstract summary: We propose a novel task-adaptive KV cache window selection method, WindowKV.<n>We show that WindowKV maintains a performance comparable to full KV cache retention while using only 12% of the original KV cache.<n>Our method also achieves state-of-the-art results in the Needle-in-a-Haystack evaluation, highlighting its effectiveness and robustness.
- Score: 9.572076809796448
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the advancements in long-context inference capabilities of large language models (LLMs), the KV cache has become one of the foundational components. However, its substantial GPU memory consumption makes KV cache compression a key technique for enabling efficient LLM inference in industrial scenarios. While recent studies have focused on optimizing the memory occupied by the KV cache, they overlook two critical factors: preserving semantic coherence and considering task-specific characteristic during compression. To address these limitations, we propose a novel task-adaptive KV cache window selection method, WindowKV. WindowKV dynamically selects local semantic windows consisting of consecutive tokens, according to task-specific characteristics, ensuring the retained KV cache captures continuous, essential context. Additionally, we introduce an intra-group layer KV cache indices sharing strategy to reduce computational overhead, achieving a balance between performance and efficiency. We rigorously evaluate WindowKV on the LongBench benchmark, and the results demonstrate that it maintains a performance comparable to full KV cache retention while using only 12% of the original KV cache, significantly reducing memory requirements. Furthermore, our method also achieves state-of-the-art results in the Needle-in-a-Haystack evaluation, highlighting its effectiveness and robustness.
Related papers
- KeepKV: Eliminating Output Perturbation in KV Cache Compression for Efficient LLMs Inference [16.53643930310808]
KeepKV is a novel adaptive KV cache merging method designed to eliminate output perturbation while preserving performance under strict memory constraints.
We show that KeepKV substantially reduces memory usage, enhances inference throughput by more than 2x and keeps superior generation quality even with 10% KV cache budgets.
arXiv Detail & Related papers (2025-04-14T06:58:00Z) - DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance [125.81664663201282]
We introduce a new KV cache compression method dubbed DBudgetKV.<n>It features an attention-based metric to signal when the remaining KV cache is unlikely to match the full-cache performance, then halting the pruning process.<n>Our method is easy to integrate within LLM inference, not only optimizing memory space, but also showing reduced inference time compared to existing methods.
arXiv Detail & Related papers (2025-02-24T06:33:39Z) - More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression [71.42818367729573]
In large language models (LLMs), the memory usage of KV cache has become a critical bottleneck during inference.
The mainstream KV compression methods, including KV pruning and KV quantization, primarily focus on either token or precision dimension separately.
In this paper, we comprehensively investigate the token-precision trade-off in KV cache compression.
arXiv Detail & Related papers (2024-12-17T09:20:31Z) - SCBench: A KV Cache-Centric Analysis of Long-Context Methods [61.025422435235456]
We introduce SCBench, a benchmark for evaluating long-context methods from a KV cachecentric perspective.<n>We provide an extensive KV cache-centric analysis of eight categories long-context solutions, including Gated Linear RNNs and Mamba-Attention hybrids.<n>Our findings show that sub-O(n) memory methods suffer in multi-turn scenarios, while sparse encoding with O(n) memory and sub-O(n2) pre-filling perform robustly.
arXiv Detail & Related papers (2024-12-13T17:59:52Z) - MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache [17.58398289266989]
Mini KV is a KV cache optimization method that simultaneously preserves long context task accuracy while significantly reducing KV cache size.<n>We show that Mini KV achieves 86% KV cache compression ratio while recovering over 98.5% of accuracy, outperforming state-of-the-art methods.
arXiv Detail & Related papers (2024-11-27T06:10:49Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z) - Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks [21.815661269986425]
We propose a novel KV cache merging approach, called KVMerger, to achieve adaptive KV cache compression for long-context tasks.
Our approach is inspired by the intriguing observation that key states exhibit high similarity at the token level within a single sequence.
We conduct extensive experiments to demonstrate the effectiveness of KVMerger for long-context tasks under constrained memory budgets.
arXiv Detail & Related papers (2024-07-11T12:50:42Z) - PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling [53.08975547824068]
We investigate whether attention-based information flow inside large language models (LLMs) is aggregated through noticeable patterns for long context processing.
Our observations reveal that LLMs aggregate information through Pyramidal Information Funneling where attention is scattering widely in lower layers.
Motivated by these insights, we developed Pyramid KV, a novel and effective KV cache compression method.
arXiv Detail & Related papers (2024-06-04T07:51:30Z) - CORM: Cache Optimization with Recent Message for Large Language Model Inference [57.109354287786154]
We introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint.
CORM, a KV cache eviction policy, dynamically retains essential key-value pairs for inference without the need for model fine-tuning.
Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70% with negligible performance degradation across six tasks in LongBench.
arXiv Detail & Related papers (2024-04-24T16:11:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.