Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction
- URL: http://arxiv.org/abs/2505.11254v1
- Date: Fri, 16 May 2025 13:48:33 GMT
- Title: Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction
- Authors: Jeffrey Willette, Heejun Lee, Sung Ju Hwang,
- Abstract summary: A transformer has a quadratic complexity, leading to high inference costs and latency for long sequences.<n>We propose a simple, novel, and effective procedure for correcting this distributional shift.<n>Our method can maintain approximately 98.5% sparsity over full quadratic attention, making our model 32 times faster than Flash Attention 2 when processing 1M token prefills.
- Score: 52.14200610448542
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The attention mechanism of a transformer has a quadratic complexity, leading to high inference costs and latency for long sequences. However, attention matrices are mostly sparse, which implies that many entries may be omitted from computation for efficient inference. Sparse attention inference methods aim to reduce this computational burden; however, they also come with a troublesome performance degradation. We discover that one reason for this degradation is that the sparse calculation induces a distributional shift in the attention outputs. The distributional shift causes decoding-time queries to fail to align well with the appropriate keys from the prefill stage, leading to a drop in performance. We propose a simple, novel, and effective procedure for correcting this distributional shift, bringing the distribution of sparse attention outputs closer to that of quadratic attention. Our method can be applied on top of any sparse attention method, and results in an average 36%pt performance increase, recovering 88% of quadratic attention accuracy on the 131K RULER benchmark when applied on top of sliding window attention with sink tokens while only adding a small overhead. Our method can maintain approximately 98.5% sparsity over full quadratic attention, making our model 32 times faster than Flash Attention 2 when processing 1M token prefills.
Related papers
- DASH: Deterministic Attention Scheduling for High-throughput Reproducible LLM Training [22.898073682504023]
In widely used attention implementations such as FlashAttention-3, the deterministic backward pass can incur up to a 37.9% throughput reduction.<n>We formulate the backward pass of deterministic attention as a scheduling problem on a Directed Acyclic Graph (DAG)<n>We present DASH (Deterministic Attention Scheduling for High-Throughput), which encapsulates two complementary scheduling strategies.
arXiv Detail & Related papers (2026-01-29T15:10:13Z) - SALAD: Achieve High-Sparsity Attention via Efficient Linear Attention Tuning for Video Diffusion Transformer [58.79642223409644]
Diffusion Transformers have recently demonstrated remarkable performance in video generation.<n>We propose SALAD, introducing a lightweight linear attention branch in parallel with the sparse attention.<n>Our method attains 90% sparsity and 1.72x inference speedup, while maintaining generation quality comparable to the full attention baseline.
arXiv Detail & Related papers (2026-01-23T07:28:53Z) - SpecAttn: Speculating Sparse Attention [1.6921396880325779]
We introduce SpecAttn, a novel training-free approach that seamlessly integrates with speculative decoding techniques.<n>Our key insight is to exploit the attention weights already computed by the draft model during speculative decoding to identify important tokens for the target model.<n>SpecAttn achieves over 75% reduction in key-value cache accesses with a mere 15.29% increase in perplexity on the PG-19 dataset.
arXiv Detail & Related papers (2025-10-31T17:12:34Z) - DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning [6.468843780300177]
We present textbfDELTA, a training-free sparse attention mechanism that achieves computational efficiency without sacrificing model accuracy.<n>Our results show that selective reuse of intermediate attention maps offers a robust path toward efficient long-context reasoning.
arXiv Detail & Related papers (2025-10-10T21:37:49Z) - SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention [88.47701139980636]
In Diffusion Transformer (DiT) models, particularly for video generation, attention latency is a major bottleneck.<n>We find that attention weights can be separated into two parts: a small fraction of large weights with high rank and the remaining weights with very low rank.<n>We propose SLA, a trainable attention method that fuses sparse and linear attention to accelerate diffusion models.
arXiv Detail & Related papers (2025-09-28T17:58:59Z) - Less Is More: Training-Free Sparse Attention with Global Locality for Efficient Reasoning [12.808478519221577]
We introduce LessIsMore, a training-free sparse attention mechanism for reasoning tasks.<n>LessIsMore aggregates token selections from local attention heads with recent contextual information.<n>It achieves a $1.13times$ end-to-end speed-up compared to existing sparse attention methods.
arXiv Detail & Related papers (2025-08-09T21:10:33Z) - Multipole Attention for Efficient Long Context Reasoning [64.94673641704289]
Large Reasoning Models (LRMs) have shown promising accuracy improvements on complex problem-solving tasks.<n>LRMs need to generate long chain-of-thought reasoning in order to think before answering.<n>We introduce Multipole Attention, which accelerates autoregressive reasoning by only computing exact attention for the most important tokens.
arXiv Detail & Related papers (2025-06-16T03:00:40Z) - SALE : Low-bit Estimation for Efficient Sparse Attention in Long-context LLM Prefilling [24.241825495462397]
Existing sparse attention methods accelerate attention computation by skipping less significant regions of the attention map.<n>We propose SALE, a fine-grained attention method that accelerates the long-context prefilling stage of LLM with negligible loss in model accuracy.<n> SALE achieves at least 3.36x speedups on Llama-3.1-8B for sequences longer than 64K while maintaining model quality.
arXiv Detail & Related papers (2025-05-30T03:40:24Z) - Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation [57.56385490252605]
Diffusion Transformers (DiTs) are essential for video generation but suffer from significant latency due to the quadratic complexity of attention.<n>We propose SVG2, a training-free framework that maximizes identification accuracy and computation minimizes waste.
arXiv Detail & Related papers (2025-05-24T21:30:29Z) - FlashBias: Fast Computation of Attention with Bias [70.44379606190569]
Attention with bias has been widely deployed in vision, language, protein-folding and other advanced scientific models.<n>It disrupts the tightly fused memory-compute pipeline that underlies the speed of accelerators like FlashAttention.<n>This paper presents FlashBias based on the low-rank compressed sensing theory, which can provide fast-exact computation for many widely used attention biases.
arXiv Detail & Related papers (2025-05-17T15:12:50Z) - ZipR1: Reinforcing Token Sparsity in MLLMs [25.92720050123066]
We propose a simple RL-based post-training method named textbfZipR1 that treats the token reduction ratio as the efficiency reward and answer accuracy as the performance reward.<n> Experimental results demonstrate that ZipR1 can reduce the token ratio of Qwen2/2.5-VL from 80% to 25% with a minimal accuracy reduction on 13 image and video benchmarks.
arXiv Detail & Related papers (2025-04-23T01:45:55Z) - Online Pseudo-average Shifting Attention(PASA) for Robust Low-precision LLM Inference: Algorithms and Numerical Analysis [15.71443217369106]
We develop a low-precision, mathematically-equivalent algorithm called PASA, based on Flash Attention.<n> PASA introduces two novel techniques: online pseudo-average shifting and global recovering.<n>We find that the large bias and amplitude of attention input data are critical factors contributing to numerical overflow.
arXiv Detail & Related papers (2025-02-26T01:00:46Z) - Tactic: Adaptive Sparse Attention with Clustering and Distribution Fitting for Long-Context LLMs [10.52833484759311]
We propose Tactic, a sparsity-adaptive and calibration-free sparse attention mechanism.<n>It dynamically selects tokens based on their cumulative attention scores rather than a fixed token budget.<n>We show that Tactic outperforms existing sparse attention algorithms, achieving superior accuracy and up to 7.29x decode attention speedup.
arXiv Detail & Related papers (2025-02-17T08:39:43Z) - Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding [1.6112718683989882]
We introduce Top-Theta Attention, or simply Top-$theta$, which selectively prunes less essential attention elements by comparing them against carefully calibrated thresholds.<n>This method greatly improves the efficiency of self-attention matrix multiplication while preserving model accuracy.<n>Unlike top-k attention, Top-$theta$ eliminates full-vector dependency, making it suitable for tiling and scale-out and avoiding costly top-k search.
arXiv Detail & Related papers (2025-02-12T12:50:15Z) - AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference [51.1972443343829]
We propose AttentionPredictor, which is the first learning-based critical token identification approach.<n> AttentionPredictor accurately predicts the attention score while consuming negligible memory.<n>We also propose a cross-token critical cache prefetching framework that hides the token time overhead to accelerate the decoding stage.
arXiv Detail & Related papers (2025-02-06T13:41:46Z) - Squeezed Attention: Accelerating Long Context Length LLM Inference [64.11145320159126]
We propose Squeezed Attention as a mechanism to accelerate LLM applications where a large portion of the input prompt is fixed.
We use K-means clustering offline to group the keys for the fixed context based on semantic similarity and represent each cluster with a single centroid value.
We then compute exact attention using only these important keys from the fixed context, thereby reducing bandwidth and computational costs.
arXiv Detail & Related papers (2024-11-14T18:54:19Z) - Only 5\% Attention Is All You Need: Efficient Long-range Document-level
Neural Machine Translation [70.87670058323239]
Document-level Neural Machine Translation (DocNMT) has been proven crucial for handling discourse phenomena by introducing document-level context information.
One of the most important directions is to input the whole document directly to the standard Transformer model.
In this work, we keep the translation performance while gaining 20% speed up by introducing extra selection layer based on lightweight attention that selects a small portion of tokens to be attended.
arXiv Detail & Related papers (2023-09-25T14:33:47Z) - RSC: Accelerating Graph Neural Networks Training via Randomized Sparse
Computations [56.59168541623729]
Training graph neural networks (GNNs) is time consuming because sparse graph-based operations are hard to be accelerated by hardware.
We explore trading off the computational precision to reduce the time complexity via sampling-based approximation.
We propose Randomized Sparse Computation, which for the first time demonstrate the potential of training GNNs with approximated operations.
arXiv Detail & Related papers (2022-10-19T17:25:33Z) - SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning [10.981433334942476]
We present SpAtten, an efficient algorithm-architecture co-design that leverages token sparsity, head sparsity, and quantization opportunities to reduce the attention computation and memory access.
Experiments on 30 benchmarks show that, on average, SpAtten reduces DRAM access by 10.0x with no accuracy loss, and achieves 1.6x, 3.0x, 162x, 347x speedup, and 1,4x, 3.2x, 1193x, 4059x energy savings.
arXiv Detail & Related papers (2020-12-17T18:59:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.