SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention
- URL: http://arxiv.org/abs/2509.24006v1
- Date: Sun, 28 Sep 2025 17:58:59 GMT
- Title: SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention
- Authors: Jintao Zhang, Haoxu Wang, Kai Jiang, Shuo Yang, Kaiwen Zheng, Haocheng Xi, Ziteng Wang, Hongzhou Zhu, Min Zhao, Ion Stoica, Joseph E. Gonzalez, Jun Zhu, Jianfei Chen,
- Abstract summary: In Diffusion Transformer (DiT) models, particularly for video generation, attention latency is a major bottleneck.<n>We find that attention weights can be separated into two parts: a small fraction of large weights with high rank and the remaining weights with very low rank.<n>We propose SLA, a trainable attention method that fuses sparse and linear attention to accelerate diffusion models.
- Score: 88.47701139980636
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In Diffusion Transformer (DiT) models, particularly for video generation, attention latency is a major bottleneck due to the long sequence length and the quadratic complexity. We find that attention weights can be separated into two parts: a small fraction of large weights with high rank and the remaining weights with very low rank. This naturally suggests applying sparse acceleration to the first part and low-rank acceleration to the second. Based on this finding, we propose SLA (Sparse-Linear Attention), a trainable attention method that fuses sparse and linear attention to accelerate diffusion models. SLA classifies attention weights into critical, marginal, and negligible categories, applying O(N^2) attention to critical weights, O(N) attention to marginal weights, and skipping negligible ones. SLA combines these computations into a single GPU kernel and supports both forward and backward passes. With only a few fine-tuning steps using SLA, DiT models achieve a 20x reduction in attention computation, resulting in significant acceleration without loss of generation quality. Experiments show that SLA reduces attention computation by 95% without degrading end-to-end generation quality, outperforming baseline methods. In addition, we implement an efficient GPU kernel for SLA, which yields a 13.7x speedup in attention computation and a 2.2x end-to-end speedup in video generation on Wan2.1-1.3B.
Related papers
- SLA2: Sparse-Linear Attention with Learnable Routing and QAT [86.22100800353991]
We show that SLA2 can achieve 97% attention sparsity and deliver an 18.6x attention speedup while preserving generation quality.<n>Experiments show that SLA2 can achieve 97% attention sparsity and deliver an 18.6x attention speedup while preserving generation quality.
arXiv Detail & Related papers (2026-02-13T07:16:02Z) - Light Forcing: Accelerating Autoregressive Video Diffusion via Sparse Attention [28.598033369607723]
textscLight Forcing is a textitfirst sparse attention solution tailored for AR video generation models.<n>It incorporates a textitChunk-Aware Growth mechanism to quantitatively estimate the contribution of each chunk.<n>We also introduce a textit Sparse Attention to capture informative historical and local context in a coarse-to-fine manner.
arXiv Detail & Related papers (2026-02-04T17:41:53Z) - Spava: Accelerating Long-Video Understanding via Sequence-Parallelism-aware Approximate Attention [63.69228529380251]
Spava is a sequence-parallel framework with optimized attention for long-video inference.<n>Spava delivers speedups of 12.72x, 1.70x, and 1.18x over FlashAttn, ZigZagRing, and APB, without notable performance loss.
arXiv Detail & Related papers (2026-01-29T09:23:13Z) - SALAD: Achieve High-Sparsity Attention via Efficient Linear Attention Tuning for Video Diffusion Transformer [58.79642223409644]
Diffusion Transformers have recently demonstrated remarkable performance in video generation.<n>We propose SALAD, introducing a lightweight linear attention branch in parallel with the sparse attention.<n>Our method attains 90% sparsity and 1.72x inference speedup, while maintaining generation quality comparable to the full attention baseline.
arXiv Detail & Related papers (2026-01-23T07:28:53Z) - Re-ttention: Ultra Sparse Visual Generation via Attention Statistical Reshape [23.01286982392074]
A huge bottleneck is the attention mechanism where complexity scales quadratically with resolution and video length.<n>Existing techniques fail to preserve visual quality at extremely high sparsity levels and might even incur non-negligible compute overheads.<n>We propose Re-ttention, which implements very high sparse attention for visual generation models.
arXiv Detail & Related papers (2025-05-28T22:39:12Z) - Hardware-Efficient Attention for Fast Decoding [13.958883001629644]
Grouped Latent Attention (GLA) is a parallel-friendly latent attention paired with low-level optimizations for fast decoding.<n>Our optimized GLA kernel is up to 2$times$ faster than FlashMLA, for example, in a speculative decoding setting.
arXiv Detail & Related papers (2025-05-27T17:54:07Z) - VORTA: Efficient Video Diffusion via Routing Sparse Attention [54.84294780326206]
VORTA is an acceleration framework with two novel components.<n>It achieves an end-to-end speedup $1.76times$ without loss of quality on VBench.<n>It can seamlessly integrate with various other acceleration methods, such as model caching and step distillation, reaching up to speedup $14.41times$ with negligible performance degradation.
arXiv Detail & Related papers (2025-05-24T17:46:47Z) - VSA: Faster Video Diffusion with Trainable Sparse Attention [21.593548582058403]
Scaling video diffusion transformers (DiTs) is limited by their quadratic 3D attention, even though most of the attention mass concentrates on a small subset of positions.<n>We turn this observation into VSA, a trainable, hardware-efficient sparse attention that replaces full attention at emphboth training and inference.
arXiv Detail & Related papers (2025-05-19T17:30:13Z) - Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction [52.14200610448542]
A transformer has a quadratic complexity, leading to high inference costs and latency for long sequences.<n>We propose a simple, novel, and effective procedure for correcting this distributional shift.<n>Our method can maintain approximately 98.5% sparsity over full quadratic attention, making our model 32 times faster than Flash Attention 2 when processing 1M token prefills.
arXiv Detail & Related papers (2025-05-16T13:48:33Z) - Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks.
We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs.
We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z) - Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models [134.83964935755964]
In deep learning, different kinds of deep networks typically need different extrapolations, which have to be chosen after multiple trials.<n>To relieve this issue and consistently improve the model training speed deep networks, we propose the ADAtive Nesterov momentum Transformer.
arXiv Detail & Related papers (2022-08-13T16:04:39Z) - FlashAttention: Fast and Memory-Efficient Exact Attention with
IO-Awareness [80.3586155104237]
FlashAttention is an IO-aware exact attention algorithm for Transformers.
It reduces the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip.
FlashAttention and block-sparse FlashAttention enable longer context in Transformers.
arXiv Detail & Related papers (2022-05-27T17:53:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.