ProphetKV: User-Query-Driven Selective Recomputation for Efficient KV Cache Reuse in Retrieval-Augmented Generation
- URL: http://arxiv.org/abs/2602.02579v3
- Date: Thu, 05 Feb 2026 03:13:02 GMT
- Title: ProphetKV: User-Query-Driven Selective Recomputation for Efficient KV Cache Reuse in Retrieval-Augmented Generation
- Authors: Shihao Wang, Jiahao Chen, Yanqi Pan, Hao Huang, Yichen Hao, Xiangyu Zou, Wen Xia, Wentao Zhang, Chongyang Qiu, Pengfei Wang,
- Abstract summary: We propose Prophet KV, a user-query-driven KV Cache reuse method for RAG scenarios.<n> Prophet KV prioritizes tokens based on their semantic relevance to the user query.<n>Our evaluation results show that Prophet KV retains 96%-101% of full-prefill accuracy with only a 20% recomputation ratio.
- Score: 22.835149054167122
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The prefill stage of long-context Retrieval-Augmented Generation (RAG) is severely bottlenecked by computational overhead. To mitigate this, recent methods assemble pre-calculated KV caches of retrieved RAG documents (by a user query) and reprocess selected tokens to recover cross-attention between these pre-calculated KV caches. However, we identify a fundamental "crowding-out effect" in current token selection criteria: globally salient but user-query-irrelevant tokens saturate the limited recomputation budget, displacing the tokens truly essential for answering the user query and degrading inference accuracy. We propose ProphetKV, a user-query-driven KV Cache reuse method for RAG scenarios. ProphetKV dynamically prioritizes tokens based on their semantic relevance to the user query and employs a dual-stage recomputation pipeline to fuse layer-wise attention metrics into a high-utility set. By ensuring the recomputation budget is dedicated to bridging the informational gap between retrieved context and the user query, ProphetKV achieves high-fidelity attention recovery with minimal overhead. Our extensive evaluation results show that ProphetKV retains 96%-101% of full-prefill accuracy with only a 20% recomputation ratio, while achieving accuracy improvements of 8.8%-24.9% on RULER and 18.6%-50.9% on LongBench over the state-of-the-art approaches (e.g., CacheBlend, EPIC, and KVShare).
Related papers
- Predicting Future Utility: Global Combinatorial Optimization for Task-Agnostic KV Cache Eviction [19.14455067106419]
Current KV cache eviction methods rely on instantaneous metrics, implicitly assuming that score magnitudes are consistent proxies for importance across all heads.<n>In this paper, we propose that optimal budget allocation should be governed by the marginal utility in preserving long-term semantic information.<n>We implement a data-driven offline profiling protocol to facilitate the practical deployment of LU-KV.
arXiv Detail & Related papers (2026-02-09T12:23:38Z) - KVReviver: Reversible KV Cache Compression with Sketch-Based Token Reconstruction [20.53279247581787]
We propose KVReviver, a reversible KV cache compression method based on the sketch algorithm.<n>In 2k-length contexts, it requires only 10% of KV Cache budget while maintaining identical end-to-end inference accuracy.<n>For 32k-length contexts, it achieves equivalent or comparable accuracy 2% accuracy loss.
arXiv Detail & Related papers (2025-12-01T03:59:20Z) - Value-Guided KV Compression for LLMs via Approximated CUR Decomposition [24.262712463465665]
CurDKV is a novel, value-centric KV compression method that selects keys and values based on leverage scores computed from CUR matrix decomposition.<n>Our approach approximates the dominant subspace of the attention output $softmax(QKT)V$, ensuring that the retained tokens best preserve the model's predictive behavior.
arXiv Detail & Related papers (2025-09-18T15:04:06Z) - Judge Q: Trainable Queries for Optimized Information Retention in KV Cache Eviction [53.83828564664595]
Large language models (LLMs) utilize key-value ( KV) cache to store historical information during sequence processing.<n>Current methods for KV cache eviction typically utilize the last window from the pre-filling phase as queries to compute the KV importance scores for eviction.<n>We propose Judge Q, a novel training method which incorporates a soft token list.
arXiv Detail & Related papers (2025-09-13T03:34:12Z) - Sparse Attention across Multiple-context KV Cache [8.236266965773465]
Reusing historical Key-Value ( KV) Cache for improved inference efficiency has become a mainstream approach.<n>Recent advances further enhance throughput by sparse attention mechanisms to select the most relevant KV Cache.<n>This paper presents SamKV, the first exploration of attention sparsification for multiple-context KV Cache.
arXiv Detail & Related papers (2025-08-06T02:53:14Z) - Cache Me If You Can: How Many KVs Do You Need for Effective Long-Context LMs? [79.58770714228983]
Language models handle increasingly long contexts for tasks such as book summarization.<n>This leads to growing memory costs for the key-value ( KV) cache.<n>Many prior works have proposed ways of discarding KVs from memory, but their approaches are tailored to favorable settings.<n>We propose the * KV footprint* as a unified metric, which accounts for both the amount of KV entries stored and their lifespan in memory.
arXiv Detail & Related papers (2025-06-20T16:21:12Z) - DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance [125.81664663201282]
We introduce a new KV cache compression method dubbed DBudgetKV.<n>It features an attention-based metric to signal when the remaining KV cache is unlikely to match the full-cache performance.<n>Our method achieves lossless KV pruning effectively and robustly, exceeding 25% compression ratio on average.
arXiv Detail & Related papers (2025-02-24T06:33:39Z) - Activation-aware Probe-Query: Effective Key-Value Retrieval for Long-Context LLMs Inference [56.71209737306054]
We propose textbfActQKV, a training-free, textbfActivation-aware approach that dynamically determines probe-textbfQuery and leverages it to retrieve the relevant textbfKV pairs for inference.<n>Experiments on the Long-Bench and $infty$ Benchmarks demonstrate its state-of-the-art performance with competitive inference quality and resource efficiency.
arXiv Detail & Related papers (2025-02-19T08:50:44Z) - A$^2$ATS: Retrieval-Based KV Cache Reduction via Windowed Rotary Position Embedding and Query-Aware Vector Quantization [17.342214950859145]
Long context large language models (LLMs) pose significant challenges for efficient serving due to the large memory footprint and high access overhead of KV cache.<n>Retrieval-based KV cache reduction methods can mitigate these challenges, typically by offloading the complete KV cache to CPU and retrieving necessary tokens on demand during inference.<n>This paper proposes A$2$ATS, a novel retrieval-based KV cache reduction method.
arXiv Detail & Related papers (2025-02-18T09:11:51Z) - More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression [71.42818367729573]
In large language models (LLMs), the memory usage of KV cache has become a critical bottleneck during inference.<n>The mainstream KV compression methods, including KV pruning and KV quantization, primarily focus on either token or precision dimension separately.<n>In this paper, we comprehensively investigate the token-precision trade-off in KV cache compression.
arXiv Detail & Related papers (2024-12-17T09:20:31Z) - PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation [97.41972925670508]
Large vision-language models (LVLMs) incur significant computational and memory overhead during inference.<n>We present PrefixKV, where "Prefix" means the top-ranked KV based on importance rather than position in the original sequence.<n>Our method achieves the state-of-the-art performance compared with others.
arXiv Detail & Related papers (2024-12-04T15:48:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.