Related papers: ProphetKV: User-Query-Driven Selective Recomputation for Efficient KV Cache Reuse in Retrieval-Augmented Generation

ProphetKV: User-Query-Driven Selective Recomputation for Efficient KV Cache Reuse in Retrieval-Augmented Generation

URL: http://arxiv.org/abs/2602.02579v3
Date: Thu, 05 Feb 2026 03:13:02 GMT
Title: ProphetKV: User-Query-Driven Selective Recomputation for Efficient KV Cache Reuse in Retrieval-Augmented Generation
Authors: Shihao Wang, Jiahao Chen, Yanqi Pan, Hao Huang, Yichen Hao, Xiangyu Zou, Wen Xia, Wentao Zhang, Chongyang Qiu, Pengfei Wang,
Abstract summary: We propose Prophet KV, a user-query-driven KV Cache reuse method for RAG scenarios.<n> Prophet KV prioritizes tokens based on their semantic relevance to the user query.<n>Our evaluation results show that Prophet KV retains 96%-101% of full-prefill accuracy with only a 20% recomputation ratio.
Score: 22.835149054167122
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The prefill stage of long-context Retrieval-Augmented Generation (RAG) is severely bottlenecked by computational overhead. To mitigate this, recent methods assemble pre-calculated KV caches of retrieved RAG documents (by a user query) and reprocess selected tokens to recover cross-attention between these pre-calculated KV caches. However, we identify a fundamental "crowding-out effect" in current token selection criteria: globally salient but user-query-irrelevant tokens saturate the limited recomputation budget, displacing the tokens truly essential for answering the user query and degrading inference accuracy. We propose ProphetKV, a user-query-driven KV Cache reuse method for RAG scenarios. ProphetKV dynamically prioritizes tokens based on their semantic relevance to the user query and employs a dual-stage recomputation pipeline to fuse layer-wise attention metrics into a high-utility set. By ensuring the recomputation budget is dedicated to bridging the informational gap between retrieved context and the user query, ProphetKV achieves high-fidelity attention recovery with minimal overhead. Our extensive evaluation results show that ProphetKV retains 96%-101% of full-prefill accuracy with only a 20% recomputation ratio, while achieving accuracy improvements of 8.8%-24.9% on RULER and 18.6%-50.9% on LongBench over the state-of-the-art approaches (e.g., CacheBlend, EPIC, and KVShare).

Related papers

Predicting Future Utility: Global Combinatorial Optimization for Task-Agnostic KV Cache Eviction [19.14455067106419]
Current KV cache eviction methods rely on instantaneous metrics, implicitly assuming that score magnitudes are consistent proxies for importance across all heads.<n>In this paper, we propose that optimal budget allocation should be governed by the marginal utility in preserving long-term semantic information.<n>We implement a data-driven offline profiling protocol to facilitate the practical deployment of LU-KV.
arXiv Detail & Related papers (2026-02-09T12:23:38Z)
KVReviver: Reversible KV Cache Compression with Sketch-Based Token Reconstruction [20.53279247581787]
We propose KVReviver, a reversible KV cache compression method based on the sketch algorithm.<n>In 2k-length contexts, it requires only 10% of KV Cache budget while maintaining identical end-to-end inference accuracy.<n>For 32k-length contexts, it achieves equivalent or comparable accuracy 2% accuracy loss.
arXiv Detail & Related papers (2025-12-01T03:59:20Z)
Value-Guided KV Compression for LLMs via Approximated CUR Decomposition [24.262712463465665]
CurDKV is a novel, value-centric KV compression method that selects keys and values based on leverage scores computed from CUR matrix decomposition.<n>Our approach approximates the dominant subspace of the attention output $softmax(QKT)V$, ensuring that the retained tokens best preserve the model's predictive behavior.
arXiv Detail & Related papers (2025-09-18T15:04:06Z)
Judge Q: Trainable Queries for Optimized Information Retention in KV Cache Eviction [53.83828564664595]
Large language models (LLMs) utilize key-value ( KV) cache to store historical information during sequence processing.<n>Current methods for KV cache eviction typically utilize the last window from the pre-filling phase as queries to compute the KV importance scores for eviction.<n>We propose Judge Q, a novel training method which incorporates a soft token list.
arXiv Detail & Related papers (2025-09-13T03:34:12Z)
Sparse Attention across Multiple-context KV Cache [8.236266965773465]
Reusing historical Key-Value ( KV) Cache for improved inference efficiency has become a mainstream approach.<n>Recent advances further enhance throughput by sparse attention mechanisms to select the most relevant KV Cache.<n>This paper presents SamKV, the first exploration of attention sparsification for multiple-context KV Cache.
arXiv Detail & Related papers (2025-08-06T02:53:14Z)
Cache Me If You Can: How Many KVs Do You Need for Effective Long-Context LMs? [79.58770714228983]
Language models handle increasingly long contexts for tasks such as book summarization.<n>This leads to growing memory costs for the key-value ( KV) cache.<n>Many prior works have proposed ways of discarding KVs from memory, but their approaches are tailored to favorable settings.<n>We propose the * KV footprint* as a unified metric, which accounts for both the amount of KV entries stored and their lifespan in memory.
arXiv Detail & Related papers (2025-06-20T16:21:12Z)
DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance [125.81664663201282]
We introduce a new KV cache compression method dubbed DBudgetKV.<n>It features an attention-based metric to signal when the remaining KV cache is unlikely to match the full-cache performance.<n>Our method achieves lossless KV pruning effectively and robustly, exceeding 25% compression ratio on average.
arXiv Detail & Related papers (2025-02-24T06:33:39Z)
Activation-aware Probe-Query: Effective Key-Value Retrieval for Long-Context LLMs Inference [56.71209737306054]
We propose textbfActQKV, a training-free, textbfActivation-aware approach that dynamically determines probe-textbfQuery and leverages it to retrieve the relevant textbfKV pairs for inference.<n>Experiments on the Long-Bench and $infty$ Benchmarks demonstrate its state-of-the-art performance with competitive inference quality and resource efficiency.
arXiv Detail & Related papers (2025-02-19T08:50:44Z)
A$^2$ATS: Retrieval-Based KV Cache Reduction via Windowed Rotary Position Embedding and Query-Aware Vector Quantization [17.342214950859145]
Long context large language models (LLMs) pose significant challenges for efficient serving due to the large memory footprint and high access overhead of KV cache.<n>Retrieval-based KV cache reduction methods can mitigate these challenges, typically by offloading the complete KV cache to CPU and retrieving necessary tokens on demand during inference.<n>This paper proposes A$2$ATS, a novel retrieval-based KV cache reduction method.
arXiv Detail & Related papers (2025-02-18T09:11:51Z)
More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression [71.42818367729573]
In large language models (LLMs), the memory usage of KV cache has become a critical bottleneck during inference.<n>The mainstream KV compression methods, including KV pruning and KV quantization, primarily focus on either token or precision dimension separately.<n>In this paper, we comprehensively investigate the token-precision trade-off in KV cache compression.
arXiv Detail & Related papers (2024-12-17T09:20:31Z)
PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation [97.41972925670508]
Large vision-language models (LVLMs) incur significant computational and memory overhead during inference.<n>We present PrefixKV, where "Prefix" means the top-ranked KV based on importance rather than position in the original sequence.<n>Our method achieves the state-of-the-art performance compared with others.
arXiv Detail & Related papers (2024-12-04T15:48:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.