Related papers: vAttention: Verified Sparse Attention

vAttention: Verified Sparse Attention

URL: http://arxiv.org/abs/2510.05688v1
Date: Tue, 07 Oct 2025 08:46:08 GMT
Title: vAttention: Verified Sparse Attention
Authors: Aditya Desai, Kumar Krishna Agrawal, Shuo Yang, Alejandro Cuadron, Luis Gaspar Schroeder, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica,
Abstract summary: vAttention is a practical sparse attention mechanism with user-specified $(epsilon, delta)$ guarantees on approximation accuracy (thus, verified)<n>We show that vAttention significantly improves the quality of sparse attention across datasets.<n>It can be deployed in reasoning scenarios to achieve fast decoding without compromising model quality.
Score: 100.98210818821688
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: State-of-the-art sparse attention methods for reducing decoding latency fall into two main categories: approximate top-$k$ (and its extension, top-$p$) and recently introduced sampling-based estimation. However, these approaches are fundamentally limited in their ability to approximate full attention: they fail to provide consistent approximations across heads and query vectors and, most critically, lack guarantees on approximation quality, limiting their practical deployment. We observe that top-$k$ and random sampling are complementary: top-$k$ performs well when attention scores are dominated by a few tokens, whereas random sampling provides better estimates when attention scores are relatively uniform. Building on this insight and leveraging the statistical guarantees of sampling, we introduce vAttention, the first practical sparse attention mechanism with user-specified $(\epsilon, \delta)$ guarantees on approximation accuracy (thus, verified). These guarantees make vAttention a compelling step toward practical, reliable deployment of sparse attention at scale. By unifying top-k and sampling, vAttention outperforms both individually, delivering a superior quality-efficiency trade-off. Our experiments show that vAttention significantly improves the quality of sparse attention (e.g., $\sim$4.5 percentage points for Llama-3.1-8B-Inst and Deepseek-R1-Distill-Llama-8B on RULER-HARD), and effectively bridges the gap between full and sparse attention (e.g., across datasets, it matches full model quality with upto 20x sparsity). We also demonstrate that it can be deployed in reasoning scenarios to achieve fast decoding without compromising model quality (e.g., vAttention achieves full model quality on AIME2024 at 10x sparsity with up to 32K token generations). Code is open-sourced at https://github.com/xAlg-ai/sparse-attention-hub.

Related papers

Towards Anytime-Valid Statistical Watermarking [63.02116925616554]
We develop the first e-value-based watermarking framework, Anchored E-Watermarking, that unifies optimal sampling with anytime-valid inference.<n>Our framework can significantly enhance sample efficiency, reducing the average token budget required for detection by 13-15% relative to state-of-the-art baselines.
arXiv Detail & Related papers (2026-02-19T18:32:26Z)
Adamas: Hadamard Sparse Attention for Efficient Long-Context Inference [15.466168180222164]
We introduce Adamas, a lightweight yet highly accurate sparse attention mechanism designed for long-context inference.<n>Experiments show that Adamas matches the accuracy of full attention with only a 64-token budget, near-lossless performance at 128, and supports up to 8x higher sparsity than prior state-of-the-art (SOTA) methods.
arXiv Detail & Related papers (2025-10-21T08:44:47Z)
Training-Free Token Pruning via Zeroth-Order Gradient Estimation in Vision-Language Models [16.540220733551823]
Large Vision-Language Models (VLMs) enable strong multimodal reasoning but incur heavy inference costs from redundant visual tokens.<n> Attention-based methods rely on raw attention scores, which are often unstable across layers and heads.<n>We propose ours, a training-free framework built on a simple intuition.
arXiv Detail & Related papers (2025-09-29T14:20:05Z)
Faster Diffusion Models via Higher-Order Approximation [28.824924809206255]
We propose a principled, training-free sampling algorithm that requires only the order of $$ d1+2/K varepsilon-1/K $$ score function evaluations.<n>Our theory is robust vis-a-vis inexact score estimation, degrading gracefully as the score estimation error increases.<n>More broadly, our work develops a theoretical framework towards understanding the efficacy of high-order methods for accelerated sampling.
arXiv Detail & Related papers (2025-06-30T16:49:03Z)
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free [81.65559031466452]
We conduct experiments to investigate gating-augmented softmax attention variants.<n>We find that a simple modification-applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)-consistently improves performance.
arXiv Detail & Related papers (2025-05-10T17:15:49Z)
Robust Conformal Prediction with a Single Binary Certificate [58.450154976190795]
Conformal prediction (CP) converts any model's output to prediction sets with a guarantee to cover the true label with (adjustable) high probability.<n>We propose a robust conformal prediction that produces smaller sets even with significantly lower MC samples.
arXiv Detail & Related papers (2025-03-07T08:41:53Z)
Robust Representation Consistency Model via Contrastive Denoising [83.47584074390842]
randomized smoothing provides theoretical guarantees for certifying robustness against adversarial perturbations.<n> diffusion models have been successfully employed for randomized smoothing to purify noise-perturbed samples.<n>We reformulate the generative modeling task along the diffusion trajectories in pixel space as a discriminative task in the latent space.
arXiv Detail & Related papers (2025-01-22T18:52:06Z)
Statistical Significance of Feature Importance Rankings [3.8642937395065124]
We devise techniques that ensure the most important features are correct with high-probability guarantees.<n>These assess the set of $K$ top-ranked features, as well as the order of its elements.<n>We then introduce two efficient sampling algorithms that identify the $K$ most important features, perhaps in order, with probability exceeding $1-alpha$.
arXiv Detail & Related papers (2024-01-28T23:14:51Z)
Distance Matters For Improving Performance Estimation Under Covariate Shift [18.68533487971233]
Under dataset shifts, confidence scores may become ill-calibrated if samples are too far from the training distribution. We show that taking into account taking into account distances of test samples to their expected training distribution can significantly improve performance estimation. We demonstrate the effectiveness of this method on 13 image classification tasks, across a wide-range of natural and synthetic distribution shifts.
arXiv Detail & Related papers (2023-08-14T15:49:19Z)
Conservative Prediction via Data-Driven Confidence Minimization [70.93946578046003]
In safety-critical applications of machine learning, it is often desirable for a model to be conservative. We propose the Data-Driven Confidence Minimization framework, which minimizes confidence on an uncertainty dataset.
arXiv Detail & Related papers (2023-06-08T07:05:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.