SALAD: Achieve High-Sparsity Attention via Efficient Linear Attention Tuning for Video Diffusion Transformer
- URL: http://arxiv.org/abs/2601.16515v1
- Date: Fri, 23 Jan 2026 07:28:53 GMT
- Title: SALAD: Achieve High-Sparsity Attention via Efficient Linear Attention Tuning for Video Diffusion Transformer
- Authors: Tongcheng Fang, Hanling Zhang, Ruiqi Xie, Zhuo Han, Xin Tao, Tianchen Zhao, Pengfei Wan, Wenbo Ding, Wanli Ouyang, Xuefei Ning, Yu Wang,
- Abstract summary: Diffusion Transformers have recently demonstrated remarkable performance in video generation.<n>We propose SALAD, introducing a lightweight linear attention branch in parallel with the sparse attention.<n>Our method attains 90% sparsity and 1.72x inference speedup, while maintaining generation quality comparable to the full attention baseline.
- Score: 58.79642223409644
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion Transformers have recently demonstrated remarkable performance in video generation. However, the long input sequences result in high computational latency due to the quadratic complexity of full attention. Various sparse attention mechanisms have been proposed. Training-free sparse attention is constrained by limited sparsity and thus offers modest acceleration, whereas training-based methods can reach much higher sparsity but demand substantial data and computation for training. In this work, we propose SALAD, introducing a lightweight linear attention branch in parallel with the sparse attention. By incorporating an input-dependent gating mechanism to finely balance the two branches, our method attains 90% sparsity and 1.72x inference speedup, while maintaining generation quality comparable to the full attention baseline. Moreover, our finetuning process is highly efficient, requiring only 2,000 video samples and 1,600 training steps with a batch size of 8.
Related papers
- SLA2: Sparse-Linear Attention with Learnable Routing and QAT [86.22100800353991]
We show that SLA2 can achieve 97% attention sparsity and deliver an 18.6x attention speedup while preserving generation quality.<n>Experiments show that SLA2 can achieve 97% attention sparsity and deliver an 18.6x attention speedup while preserving generation quality.
arXiv Detail & Related papers (2026-02-13T07:16:02Z) - Efficient-LVSM: Faster, Cheaper, and Better Large View Synthesis Model via Decoupled Co-Refinement Attention [105.11288339285154]
Efficient-LVSM is a dual-stream architecture that applies intra-view self-attention for input views and self-then-cross attention for target views.<n>It achieves 29.86 dB PSNR on RealEstate10K with 2 input views, surpassing LVSM by 0.2 dB, with 2x faster training convergence and 4.4x faster inference speed.
arXiv Detail & Related papers (2026-02-06T08:11:58Z) - Attention Surgery: An Efficient Recipe to Linearize Your Video Diffusion Transformer [13.545000689565732]
Transformer-based video diffusion models (VDMs) deliver state-of-the-art video generation quality but are constrained by the quadratic cost of self-attention.<n>We introduce Attention Surgery, an efficient framework for linearizing or hybridizing attention in pretrained VDMs without training from scratch.
arXiv Detail & Related papers (2025-09-29T15:09:51Z) - SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention [88.47701139980636]
In Diffusion Transformer (DiT) models, particularly for video generation, attention latency is a major bottleneck.<n>We find that attention weights can be separated into two parts: a small fraction of large weights with high rank and the remaining weights with very low rank.<n>We propose SLA, a trainable attention method that fuses sparse and linear attention to accelerate diffusion models.
arXiv Detail & Related papers (2025-09-28T17:58:59Z) - Bidirectional Sparse Attention for Faster Video Diffusion Training [14.523882232476092]
Video diffusion Transformer (DiT) models excel in generative quality but hit major computational bottlenecks when producing high-resolution, long-duration videos.<n>We propose a Bidirectional Sparse Attention (BSA) framework for faster video DiT training, the first to dynamically sparsify both Queries and Key-Value pairs within 3D full attention.<n>BSA significantly accelerates DiT training across long sequences, reducing FLOPs by up to 20x and achieving 17.79x faster attention training, while preserving or even surpassing the generative quality of full attention.
arXiv Detail & Related papers (2025-09-01T03:16:52Z) - Training-Free Efficient Video Generation via Dynamic Token Carving [54.52061549312799]
Jenga is an inference pipeline that combines dynamic attention carving with progressive resolution generation.<n>As a plug-and-play solution, Jenga enables practical, high-quality video generation on modern hardware.
arXiv Detail & Related papers (2025-05-22T16:21:32Z) - VSA: Faster Video Diffusion with Trainable Sparse Attention [38.37291040904089]
Scaling video diffusion transformers (DiTs) is limited by their quadratic 3D attention, even though most of the attention mass concentrates on a small subset of positions.<n>We turn this observation into VSA, a trainable, hardware-efficient sparse attention that replaces full attention at emphboth training and inference.
arXiv Detail & Related papers (2025-05-19T17:30:13Z) - Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction [52.14200610448542]
A transformer has a quadratic complexity, leading to high inference costs and latency for long sequences.<n>We propose a simple, novel, and effective procedure for correcting this distributional shift.<n>Our method can maintain approximately 98.5% sparsity over full quadratic attention, making our model 32 times faster than Flash Attention 2 when processing 1M token prefills.
arXiv Detail & Related papers (2025-05-16T13:48:33Z) - ParallelComp: Parallel Long-Context Compressor for Length Extrapolation [51.68913021512016]
Extrapolating ultra-long contexts (text length >128K) remains a major challenge for large language models (LLMs)<n>In this work, we propose ParallelComp, a parallel long-context compression method that effectively overcomes the memory bottleneck.<n>We achieve a 1.76x improvement in chunk throughput, thereby achieving a 23.50x acceleration in the prefill stage with negligible performance loss.
arXiv Detail & Related papers (2025-02-20T07:10:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.