QUOKA: Query-Oriented KV Selection For Efficient LLM Prefill
- URL: http://arxiv.org/abs/2602.08722v1
- Date: Mon, 09 Feb 2026 14:32:26 GMT
- Title: QUOKA: Query-Oriented KV Selection For Efficient LLM Prefill
- Authors: Dalton Jones, Junyoung Park, Matthew Morse, Mingu Lee, Chris Lott, Harper Langston,
- Abstract summary: We present QUOKA: Query-oriented KV selection for efficient attention.<n>We show that QUOKA achieves near-baseline accuracy, utilizing 88% fewer key-value pairs per attention evaluation.
- Score: 5.014026212750645
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present QUOKA: Query-oriented KV selection for efficient attention, a training-free and hardware agnostic sparse attention algorithm for accelerating transformer inference under chunked prefill. While many queries focus on a smaller group of keys in the attention operator, we observe that queries with low cosine similarity with respect to the mean query interact more strongly with more keys and have the greatest contribution to final attention logits. By prioritizing these low cosine similarity queries, the behavior of full attention during the prefill stage can be closely approximated. QUOKA leverages this observation, accelerating attention by (1) first retaining a small set of representative queries and (2) then subselectin the keys most aligned with those queries. Through experiments on Needle-In-A-Haystack, LongBench, RULER, and Math500, we show that, while realizing a 3x reduction in time-to-first-token, 5x speedup in attention on Nvidia GPUs and up to nearly a 7x speedup on Intel Xeon CPUs, QUOKA achieves near-baseline accuracy, utilizing 88% fewer key-value pairs per attention evaluation.
Related papers
- Near-Oracle KV Selection via Pre-hoc Sparsity for Long-Context Inference [54.467557491325046]
We propose Pre-hoc Sparsity (PrHS), which selects KV entries before attention scoring and provides explicit accuracy control.<n>PrHS reduces retrieval overhead by over 90%, achieving 3x higher retrieval sparsity than HShare at matched or better accuracy.<n>It incurs under 1% average degradation on LongBench, lowers attention FLOPs by about 15% versus prior baselines, and yields a 9.9x speedup in attention-operator latency and 2.8x higher throughput.
arXiv Detail & Related papers (2026-02-09T07:05:23Z) - Efficient Low Rank Attention for Long-Context Inference in Large Language Models [41.24530756499533]
Low Rank Query and Key attention (LRQK) is a framework that decomposes the full-precision query and key matrices into compact rank-(r) factors during the prefill stage.<n>By selecting only the top-(k) tokens and a small fixed set of recent tokens, LRQK employs a mixed GPU- CPU cache with a hit-and-miss mechanism that transfers only missing full-precision KV pairs.
arXiv Detail & Related papers (2025-10-25T11:43:27Z) - vAttention: Verified Sparse Attention [100.98210818821688]
vAttention is a practical sparse attention mechanism with user-specified $(epsilon, delta)$ guarantees on approximation accuracy (thus, verified)<n>We show that vAttention significantly improves the quality of sparse attention across datasets.<n>It can be deployed in reasoning scenarios to achieve fast decoding without compromising model quality.
arXiv Detail & Related papers (2025-10-07T08:46:08Z) - Multipole Attention for Efficient Long Context Reasoning [64.94673641704289]
Large Reasoning Models (LRMs) have shown promising accuracy improvements on complex problem-solving tasks.<n>LRMs need to generate long chain-of-thought reasoning in order to think before answering.<n>We introduce Multipole Attention, which accelerates autoregressive reasoning by only computing exact attention for the most important tokens.
arXiv Detail & Related papers (2025-06-16T03:00:40Z) - Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction [52.14200610448542]
A transformer has a quadratic complexity, leading to high inference costs and latency for long sequences.<n>We propose a simple, novel, and effective procedure for correcting this distributional shift.<n>Our method can maintain approximately 98.5% sparsity over full quadratic attention, making our model 32 times faster than Flash Attention 2 when processing 1M token prefills.
arXiv Detail & Related papers (2025-05-16T13:48:33Z) - AttentionPredictor: Temporal Patterns Matter for KV Cache Compression [64.75459635661562]
We propose AttentionPredictor, which is the first learning-based method to directly predict attention patterns for KV cache compression and critical token identification.<n> AttentionPredictor accurately predicts the attention score and shares the unified prediction model, which consumes negligible memory.<n>By retaining most of the attention information, AttentionPredictor achieves 13$times$ KV cache compression and 5.6$times$ speedup in a cache offloading scenario.
arXiv Detail & Related papers (2025-02-06T13:41:46Z) - Squeezed Attention: Accelerating Long Context Length LLM Inference [61.787865959140994]
We propose Squeezed Attention to accelerate applications where a large portion of the input context is fixed.<n>During inference, we compare query tokens from the user input with the centroids to predict which keys from the fixed context are semantically relevant.<n>We also present a hierarchical version of our algorithm which can reduce the complexity of attention from linear to logarithmic with respect to the fixed context length.
arXiv Detail & Related papers (2024-11-14T18:54:19Z) - CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs [8.649971923487835]
We propose CritiPrefill, a criticality-based segment-wise prefilling method for long-context processing.
CritiPrefill partitions the input sequence's queries and KV cache into segments and blocks, utilizing a segment-wise algorithm to estimate the query criticality.
Extensive evaluations on multiple long-context datasets show up to 2.7x speedup on Llama3-8B and 3.0x speedup on Yi-9B for 128K context length on a single A100 GPU.
arXiv Detail & Related papers (2024-09-19T06:09:56Z) - RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval [24.472784635757016]
RetrievalAttention is a training-free approach to both accelerate attention computation and reduce GPU memory consumption.<n>We show that RetrievalAttention achieves near full attention accuracy while only requiring access to 1--3% of the data.
arXiv Detail & Related papers (2024-09-16T17:59:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.