Attention in Constant Time: Vashista Sparse Attention for Long-Context Decoding with Exponential Guarantees
- URL: http://arxiv.org/abs/2602.13804v1
- Date: Sat, 14 Feb 2026 14:29:10 GMT
- Title: Attention in Constant Time: Vashista Sparse Attention for Long-Context Decoding with Exponential Guarantees
- Authors: Vashista Nobaub,
- Abstract summary: Large language models spend most of their inference cost on attention over long contexts.<n>We formalize this phenomenon by modeling attention as a projection onto the convex hull of key vectors.<n>We introduce Vashista Sparse Attention, a drop-in mechanism that maintains a small candidate set per query.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models spend most of their inference cost on attention over long contexts, yet empirical behavior suggests that only a small subset of tokens meaningfully contributes to each query. We formalize this phenomenon by modeling attention as a projection onto the convex hull of key vectors and analyzing its entropic (softmax-like) relaxation. Our main theoretical contribution is a face-stability theorem showing that, under a strict complementarity margin (a support gap (Δ) certified by KKT multipliers), entropic attention concentrates on a constant-size active face: the total mass assigned to inactive tokens decays exponentially as (\exp(-Ω(Δ/\varepsilon))), while the error on the active face scales linearly in the temperature/regularization parameter (\varepsilon). This yields a practical criterion for when sparse long-context decoding is safe and provides a principled knob to trade accuracy for compute. Building on these guarantees, we introduce Vashista Sparse Attention, a drop-in mechanism that maintains a small candidate set per query through a paging-style context selection strategy compatible with modern inference stacks. Across long-context evaluations, we observe stable constant-size effective support, strong wall-clock speedups, and minimal quality degradation in the regimes predicted by the support-gap diagnostics. Finally, we discuss deployment implications for privacy-sensitive and air-gapped settings, where interchangeable attention modules enable predictable latency and cost without external retrieval dependencies.
Related papers
- Rethinking Multi-Condition DiTs: Eliminating Redundant Attention via Position-Alignment and Keyword-Scoping [61.459927600301654]
Multi-condition control is bottlenecked by the conventional concatenate-and-attend'' strategy.<n>Our analysis reveals that much of this cross-modal interaction is spatially or semantically redundant.<n>We propose Position-aligned and Keyword-scoped Attention (PKA), a highly efficient framework designed to eliminate these redundancies.
arXiv Detail & Related papers (2026-02-06T16:39:10Z) - State Rank Dynamics in Linear Attention LLMs [37.607046806053035]
State Rank Stratification is characterized by a distinct spectral bifurcation among linear attention heads.<n>Low-rank heads are indispensable for model reasoning, whereas high-rank heads exhibit significant redundancy.<n>We propose Joint Rank-Norm Pruning, a zero-shot strategy that achieves a 38.9% reduction in KV-cache overhead while largely maintaining model accuracy.
arXiv Detail & Related papers (2026-02-02T15:00:42Z) - Punctuation-aware Hybrid Trainable Sparse Attention for Large Language Models [44.28116882776357]
We present textbfPunctuation-aware textbfHybrid textbfSparse textbfAttention textbf(PHSA), a trainable sparse attention framework that leverages punctuation tokens as semantic boundary anchors.<n>Specifically, we design a dual-branch aggregation mechanism that fuses global semantic representations with punctuation-enhanced boundary features, preserving the core semantic structure while introducing almost no additional computational overhead.
arXiv Detail & Related papers (2026-01-06T08:47:16Z) - Training-free Context-adaptive Attention for Efficient Long Context Modeling [57.703159205740185]
Training-free Context-adaptive Attention (TCA-Attention) is a training-free sparse attention mechanism that selectively attends to only the informative tokens for efficient long-context inference.<n>TCA-Attention achieves a 2.8$times$ speedup and reduces KV cache by 61% at 128K context length while maintaining performance comparable to full attention.
arXiv Detail & Related papers (2025-12-10T01:54:57Z) - Priors in Time: Missing Inductive Biases for Language Model Interpretability [58.07412640266836]
We show that Sparse Autoencoders impose priors that assume independence of concepts across time, implying stationarity.<n>We introduce a new interpretability objective -- Temporal Feature Analysis -- which possesses a temporal inductive bias to decompose representations at a given time into two parts.<n>Our results underscore the need for inductive biases that match the data in designing robust interpretability tools.
arXiv Detail & Related papers (2025-11-03T18:43:48Z) - Does More Inference-Time Compute Really Help Robustness? [50.47666612618054]
We show that small-scale, open-source models can benefit from inference-time scaling.<n>We identify an important security risk, intuitively motivated and empirically verified as an inverse scaling law.<n>We urge practitioners to carefully weigh these subtle trade-offs before applying inference-time scaling in security-sensitive, real-world applications.
arXiv Detail & Related papers (2025-07-21T18:08:38Z) - Transformers Learn Faster with Semantic Focus [57.97235825738412]
We study sparse transformers in terms of learnability and generalization.<n>We find that input-dependent sparse attention models appear to converge faster and generalize better than standard attention models.
arXiv Detail & Related papers (2025-06-17T01:19:28Z) - Beyond Exponential Decay: Rethinking Error Accumulation in Large Language Models [0.0]
We show that errors are not uniformly distributed but are concentrated at sparse "key tokens" representing critical decision junctions.<n>We propose a framework for next-generation systems centered on selective preservation of semantically vital tokens.
arXiv Detail & Related papers (2025-05-30T03:57:31Z) - Activation-aware Probe-Query: Effective Key-Value Retrieval for Long-Context LLMs Inference [56.71209737306054]
We propose textbfActQKV, a training-free, textbfActivation-aware approach that dynamically determines probe-textbfQuery and leverages it to retrieve the relevant textbfKV pairs for inference.<n>Experiments on the Long-Bench and $infty$ Benchmarks demonstrate its state-of-the-art performance with competitive inference quality and resource efficiency.
arXiv Detail & Related papers (2025-02-19T08:50:44Z) - AttentionPredictor: Temporal Patterns Matter for KV Cache Compression [64.75459635661562]
We propose AttentionPredictor, which is the first learning-based method to directly predict attention patterns for KV cache compression and critical token identification.<n> AttentionPredictor accurately predicts the attention score and shares the unified prediction model, which consumes negligible memory.<n>By retaining most of the attention information, AttentionPredictor achieves 13$times$ KV cache compression and 5.6$times$ speedup in a cache offloading scenario.
arXiv Detail & Related papers (2025-02-06T13:41:46Z) - How Smooth Is Attention? [26.322030088685928]
We provide a detailed study of the Lipschitz constant of self-attention in several practical scenarios.
We show that for inputs of length $n$ in any compact set, the Lipschitz constant of self-attention is bounded by $sqrtn$ up to a constant factor.
Our mean-field framework for masked self-attention is novel and of independent interest.
arXiv Detail & Related papers (2023-12-22T16:47:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.