Sparse Attention across Multiple-context KV Cache
- URL: http://arxiv.org/abs/2508.11661v1
- Date: Wed, 06 Aug 2025 02:53:14 GMT
- Title: Sparse Attention across Multiple-context KV Cache
- Authors: Ziyi Cao, Qingyi Si, Jingbin Zhang, Bingquan Liu,
- Abstract summary: Reusing historical Key-Value ( KV) Cache for improved inference efficiency has become a mainstream approach.<n>Recent advances further enhance throughput by sparse attention mechanisms to select the most relevant KV Cache.<n>This paper presents SamKV, the first exploration of attention sparsification for multiple-context KV Cache.
- Score: 8.236266965773465
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models face significant cost challenges in long-sequence inference. To address this, reusing historical Key-Value (KV) Cache for improved inference efficiency has become a mainstream approach. Recent advances further enhance throughput by sparse attention mechanisms to select the most relevant KV Cache, thereby reducing sequence length. However, such techniques are limited to single-context scenarios, where historical KV Cache is computed sequentially with causal-attention dependencies. In retrieval-augmented generation (RAG) scenarios, where retrieved documents as context are unknown beforehand, each document's KV Cache is computed and stored independently (termed multiple-context KV Cache), lacking cross-attention between contexts. This renders existing methods ineffective. Although prior work partially recomputes multiple-context KV Cache to mitigate accuracy loss from missing cross-attention, it requires retaining all KV Cache throughout, failing to reduce memory overhead. This paper presents SamKV, the first exploration of attention sparsification for multiple-context KV Cache. Specifically, SamKV takes into account the complementary information of other contexts when sparsifying one context, and then locally recomputes the sparsified information. Experiments demonstrate that our method compresses sequence length to 15% without accuracy degradation compared with full-recompuation baselines, significantly boosting throughput in multi-context RAG scenarios.
Related papers
- KVReviver: Reversible KV Cache Compression with Sketch-Based Token Reconstruction [20.53279247581787]
We propose KVReviver, a reversible KV cache compression method based on the sketch algorithm.<n>In 2k-length contexts, it requires only 10% of KV Cache budget while maintaining identical end-to-end inference accuracy.<n>For 32k-length contexts, it achieves equivalent or comparable accuracy 2% accuracy loss.
arXiv Detail & Related papers (2025-12-01T03:59:20Z) - G-KV: Decoding-Time KV Cache Eviction with Global Attention [57.47409249054187]
Large language models (LLMs) excel in complex tasks but encounter significant computational and memory challenges due to long sequence lengths.<n> KV cache compression has emerged as an effective approach to greatly enhance the efficiency of reasoning.<n>We propose G-KV, a KV cache eviction method that employs a global scoring mechanism, combining local and historical attention scores to more accurately assess token importance.
arXiv Detail & Related papers (2025-11-29T14:21:33Z) - Retrospective Sparse Attention for Efficient Long-Context Generation [5.562294018150909]
RetroAttention retrospectively revises past attention outputs using newly arrived KV entries from subsequent decoding steps.<n>This breaks the fixed-attention-output paradigm and allows continual correction of prior approximations.<n>Experiments show that RetroAttention consistently outperforms state-of-the-art (SOTA) KV compression methods.
arXiv Detail & Related papers (2025-08-12T15:11:47Z) - KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction [37.489744618357655]
Transformer-based large language models (LLMs) cache context as key-value ( KV) pairs during inference.<n>As context length grows, KV cache sizes expand, leading to substantial memory overhead and increased attention latency.<n>This paper introduces KVzip, a query-agnostic KV cache eviction method enabling effective reuse of compressed KV caches across diverse queries.
arXiv Detail & Related papers (2025-05-29T13:05:47Z) - FlowKV: Enhancing Multi-Turn Conversational Coherence in LLMs via Isolated Key-Value Cache Management [27.734106884226005]
FlowKV is a novel multi-turn isolation mechanism for KV Cache management.<n>It preserves the accumulated compressed KV cache from past turns.<n>It prevents the re-compression of older context and thereby mitigating catastrophic forgetting.
arXiv Detail & Related papers (2025-05-21T10:20:46Z) - KeepKV: Eliminating Output Perturbation in KV Cache Compression for Efficient LLMs Inference [16.53643930310808]
KeepKV is a novel adaptive KV cache merging method designed to eliminate output perturbation while preserving performance under strict memory constraints.<n>We show that KeepKV substantially reduces memory usage, enhances inference throughput by more than 2x and keeps superior generation quality even with 10% KV cache budgets.
arXiv Detail & Related papers (2025-04-14T06:58:00Z) - DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance [125.81664663201282]
We introduce a new KV cache compression method dubbed DBudgetKV.<n>It features an attention-based metric to signal when the remaining KV cache is unlikely to match the full-cache performance.<n>Our method achieves lossless KV pruning effectively and robustly, exceeding 25% compression ratio on average.
arXiv Detail & Related papers (2025-02-24T06:33:39Z) - KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse [35.97391418064724]
We describe KVLink, an approach for efficient key-value ( KV) cache reuse in large language models (LLMs)<n> KVLink introduces two key techniques: adjusting positional embeddings of the KV cache at inference to match the global position after concatenation, and using trainable special tokens to restore self-attention.<n> Experiments across 7 datasets demonstrate that KVLink improves question answering accuracy by an average of 4% over state-of-the-art methods.
arXiv Detail & Related papers (2025-02-21T23:34:29Z) - More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression [71.42818367729573]
In large language models (LLMs), the memory usage of KV cache has become a critical bottleneck during inference.<n>The mainstream KV compression methods, including KV pruning and KV quantization, primarily focus on either token or precision dimension separately.<n>In this paper, we comprehensively investigate the token-precision trade-off in KV cache compression.
arXiv Detail & Related papers (2024-12-17T09:20:31Z) - SCBench: A KV Cache-Centric Analysis of Long-Context Methods [61.025422435235456]
We introduce SCBench, a benchmark for evaluating long-context methods from a KV cachecentric perspective.<n>We provide an extensive KV cache-centric analysis of eight categories long-context solutions, including Gated Linear RNNs and Mamba-Attention hybrids.<n>Our findings show that sub-O(n) memory methods suffer in multi-turn scenarios, while sparse encoding with O(n) memory and sub-O(n2) pre-filling perform robustly.
arXiv Detail & Related papers (2024-12-13T17:59:52Z) - PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation [97.41972925670508]
Large vision-language models (LVLMs) incur significant computational and memory overhead during inference.<n>We present PrefixKV, where "Prefix" means the top-ranked KV based on importance rather than position in the original sequence.<n>Our method achieves the state-of-the-art performance compared with others.
arXiv Detail & Related papers (2024-12-04T15:48:59Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.