Near-Oracle KV Selection via Pre-hoc Sparsity for Long-Context Inference
- URL: http://arxiv.org/abs/2602.08329v1
- Date: Mon, 09 Feb 2026 07:05:23 GMT
- Title: Near-Oracle KV Selection via Pre-hoc Sparsity for Long-Context Inference
- Authors: Yifei Gao, Lei Wang, Rong-Cheng Tu, Qixin Zhang, Jun Cheng, Dacheng Tao,
- Abstract summary: We propose Pre-hoc Sparsity (PrHS), which selects KV entries before attention scoring and provides explicit accuracy control.<n>PrHS reduces retrieval overhead by over 90%, achieving 3x higher retrieval sparsity than HShare at matched or better accuracy.<n>It incurs under 1% average degradation on LongBench, lowers attention FLOPs by about 15% versus prior baselines, and yields a 9.9x speedup in attention-operator latency and 2.8x higher throughput.
- Score: 54.467557491325046
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A core bottleneck in large language model (LLM) inference is the cost of attending over the ever-growing key-value (KV) cache. Although near-oracle top-k KV selection can preserve the quality of dense attention while sharply reducing computation and bandwidth, existing sparse methods generally rely on posterior heuristics, i.e., selectors conditioned on observed attention or proxy scores. Such conditioning introduces posterior bias: it tends to distort true token importance and miss salient tokens, thereby impairing long-range reasoning. To tackle this problem, we propose Pre-hoc Sparsity (PrHS), which selects KV entries before attention scoring and provides explicit accuracy control. Let the attention mass of discarded entries be delta (the dropped mass). Through a marginal-to-mutual-information analysis, we derive an upper bound on the mutual-information loss that depends only on the dropped mass. This relation explains failure modes of posterior heuristics and enables verifiable guarantees by controlling the dropped mass in advance. Within PrHS, we instantiate three orthogonal pre-hoc selectors along the axes of time, depth, and layer. Extensive experiments on LLaMA and Mistral families validate PrHS. Across GSM8K and CoQA, PrHS reduces retrieval overhead by over 90%, achieving 3x higher retrieval sparsity than HShare at matched or better accuracy. It incurs under 1% average degradation on LongBench, lowers attention FLOPs by about 15% versus prior sparse baselines, and yields a 9.9x speedup in attention-operator latency and 2.8x higher throughput on NVIDIA A100-80GB GPUs than the dense baseline.
Related papers
- QUOKA: Query-Oriented KV Selection For Efficient LLM Prefill [5.014026212750645]
We present QUOKA: Query-oriented KV selection for efficient attention.<n>We show that QUOKA achieves near-baseline accuracy, utilizing 88% fewer key-value pairs per attention evaluation.
arXiv Detail & Related papers (2026-02-09T14:32:26Z) - Predicting Future Utility: Global Combinatorial Optimization for Task-Agnostic KV Cache Eviction [19.14455067106419]
Current KV cache eviction methods rely on instantaneous metrics, implicitly assuming that score magnitudes are consistent proxies for importance across all heads.<n>In this paper, we propose that optimal budget allocation should be governed by the marginal utility in preserving long-term semantic information.<n>We implement a data-driven offline profiling protocol to facilitate the practical deployment of LU-KV.
arXiv Detail & Related papers (2026-02-09T12:23:38Z) - KQ-SVD: Compressing the KV Cache with Provable Guarantees on Attention Fidelity [6.542188603141656]
Key-Value cache is central to the efficiency of large language models.<n>As sequence length and batch size grow, the cache becomes a major memory bottleneck.<n>We introduce KQ-SVD, a simple and computationally efficient method that directly performs an optimal low-rank decomposition of the attention matrix.
arXiv Detail & Related papers (2025-12-05T17:51:10Z) - vAttention: Verified Sparse Attention [100.98210818821688]
vAttention is a practical sparse attention mechanism with user-specified $(epsilon, delta)$ guarantees on approximation accuracy (thus, verified)<n>We show that vAttention significantly improves the quality of sparse attention across datasets.<n>It can be deployed in reasoning scenarios to achieve fast decoding without compromising model quality.
arXiv Detail & Related papers (2025-10-07T08:46:08Z) - Value-Guided KV Compression for LLMs via Approximated CUR Decomposition [24.262712463465665]
CurDKV is a novel, value-centric KV compression method that selects keys and values based on leverage scores computed from CUR matrix decomposition.<n>Our approach approximates the dominant subspace of the attention output $softmax(QKT)V$, ensuring that the retained tokens best preserve the model's predictive behavior.
arXiv Detail & Related papers (2025-09-18T15:04:06Z) - Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction [52.14200610448542]
A transformer has a quadratic complexity, leading to high inference costs and latency for long sequences.<n>We propose a simple, novel, and effective procedure for correcting this distributional shift.<n>Our method can maintain approximately 98.5% sparsity over full quadratic attention, making our model 32 times faster than Flash Attention 2 when processing 1M token prefills.
arXiv Detail & Related papers (2025-05-16T13:48:33Z) - AttentionPredictor: Temporal Patterns Matter for KV Cache Compression [64.75459635661562]
We propose AttentionPredictor, which is the first learning-based method to directly predict attention patterns for KV cache compression and critical token identification.<n> AttentionPredictor accurately predicts the attention score and shares the unified prediction model, which consumes negligible memory.<n>By retaining most of the attention information, AttentionPredictor achieves 13$times$ KV cache compression and 5.6$times$ speedup in a cache offloading scenario.
arXiv Detail & Related papers (2025-02-06T13:41:46Z) - Predicting Overtakes in Trucks Using CAN Data [51.28632782308621]
We investigate the detection of truck overtakes from CAN data.
Our analysis covers up to 10 seconds before the overtaking event.
We observe that the prediction scores of the overtake class tend to increase as we approach the overtake trigger.
arXiv Detail & Related papers (2024-04-08T17:58:22Z) - Certified Error Control of Candidate Set Pruning for Two-Stage Relevance
Ranking [57.42241521034744]
We propose the concept of certified error control of candidate set pruning for relevance ranking.
Our method successfully prunes the first-stage retrieved candidate sets to improve the second-stage reranking speed.
arXiv Detail & Related papers (2022-05-19T16:00:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.