Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query
- URL: http://arxiv.org/abs/2505.20334v1
- Date: Sat, 24 May 2025 10:34:38 GMT
- Title: Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query
- Authors: Yixuan Wang, Shiyu Ji, Yijun Liu, Yuzhuang Xu, Yang Xu, Qingfu Zhu, Wanxiang Che,
- Abstract summary: KV cache memory usage grows substantially with longer text sequences.<n>Existing KV cache eviction methods prune tokens using prefilling-stage attention scores.<n>Lookahead Q-Cache generates low-cost pseudo lookahead queries to better approximate the true decoding-stage queries.
- Score: 48.52389201779425
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Large language models (LLMs) rely on key-value cache (KV cache) to accelerate decoding by reducing redundant computations. However, the KV cache memory usage grows substantially with longer text sequences, posing challenges for efficient deployment. Existing KV cache eviction methods prune tokens using prefilling-stage attention scores, causing inconsistency with actual inference queries, especially under tight memory budgets. In this paper, we propose Lookahead Q-Cache (LAQ), a novel eviction framework that generates low-cost pseudo lookahead queries to better approximate the true decoding-stage queries. By using these lookahead queries as the observation window for importance estimation, LAQ achieves more consistent and accurate KV cache eviction aligned with real inference scenarios. Experimental results on LongBench and Needle-in-a-Haystack benchmarks show that LAQ outperforms existing methods across various budget levels, achieving a 1 $\sim$ 4 point improvement on LongBench under limited cache budget. Moreover, LAQ is complementary to existing approaches and can be flexibly combined to yield further improvements.
Related papers
- CAKE: Cascading and Adaptive KV Cache Eviction with Layer Preferences [36.05521425453999]
Large language models (LLMs) excel at processing long sequences, boosting demand for key-value ( KV) caching.<n>We introduce Cascading and Adaptive KV cache Eviction (CAKE), a novel approach that frames KV cache eviction as a "cake-slicing problem"<n>CAKE assesses layer-specific preferences by considering attention dynamics in both spatial and temporal dimensions, allocates rational cache size for layers accordingly, and manages memory constraints in a cascading manner.
arXiv Detail & Related papers (2025-03-16T12:49:44Z) - DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance [125.81664663201282]
We introduce a new KV cache compression method dubbed DBudgetKV.<n>It features an attention-based metric to signal when the remaining KV cache is unlikely to match the full-cache performance.<n>Our method achieves lossless KV pruning effectively and robustly, exceeding 25% compression ratio on average.
arXiv Detail & Related papers (2025-02-24T06:33:39Z) - A$^2$ATS: Retrieval-Based KV Cache Reduction via Windowed Rotary Position Embedding and Query-Aware Vector Quantization [17.342214950859145]
Long context large language models (LLMs) pose significant challenges for efficient serving due to the large memory footprint and high access overhead of KV cache.<n>Retrieval-based KV cache reduction methods can mitigate these challenges, typically by offloading the complete KV cache to CPU and retrieving necessary tokens on demand during inference.<n>This paper proposes A$2$ATS, a novel retrieval-based KV cache reduction method.
arXiv Detail & Related papers (2025-02-18T09:11:51Z) - SCBench: A KV Cache-Centric Analysis of Long-Context Methods [61.025422435235456]
We introduce SCBench, a benchmark for evaluating long-context methods from a KV cachecentric perspective.<n>We provide an extensive KV cache-centric analysis of eight categories long-context solutions, including Gated Linear RNNs and Mamba-Attention hybrids.<n>Our findings show that sub-O(n) memory methods suffer in multi-turn scenarios, while sparse encoding with O(n) memory and sub-O(n2) pre-filling perform robustly.
arXiv Detail & Related papers (2024-12-13T17:59:52Z) - ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression [10.003118268356017]
Long context poses significant challenges for inference efficiency.<n>We introduce ClusterKV, which recalls tokens at the granularity of semantic clusters.<n>Experiment results show that ClusterKV attains negligible accuracy loss across various tasks with 32k context lengths.
arXiv Detail & Related papers (2024-12-04T10:58:27Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z) - PQCache: Product Quantization-based KVCache for Long Context LLM Inference [27.523568511043273]
Key-Value Cache (KVCache) is the intermediate representations of tokens within Large Language Models (LLMs)<n>We propose PQCache, which employs Product Quantization (PQ) to manage KVCache, maintaining model quality while ensuring low serving latency.<n>PQCache achieves both effectiveness and efficiency, with 4.60% score improvement over existing methods on InfiniteBench.
arXiv Detail & Related papers (2024-07-01T13:05:42Z) - CORM: Cache Optimization with Recent Message for Large Language Model Inference [57.109354287786154]
We introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint.
CORM, a KV cache eviction policy, dynamically retains essential key-value pairs for inference without the need for model fine-tuning.
Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70% with negligible performance degradation across six tasks in LongBench.
arXiv Detail & Related papers (2024-04-24T16:11:54Z) - Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention [13.041210267981613]
CachedAttention is a new attention mechanism that enables reuse of KV caches across multi-turn conversations.
It significantly decreases the time to the first token (TTFT) by up to 87%, improves the prompt prefilling throughput by up to 7.8$times$ for multi-turn conversations, and reduces the end-to-end inference cost by up to 70%.
arXiv Detail & Related papers (2024-03-23T10:42:49Z) - Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference [78.65321721142624]
We focus on a memory bottleneck imposed by the key-value ( KV) cache.
Existing KV cache methods approach this problem by pruning or evicting large swaths of relatively less important KV pairs.
We propose LESS, a simple integration of a constant sized cache with eviction-based cache methods.
arXiv Detail & Related papers (2024-02-14T18:54:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.