VSA: Faster Video Diffusion with Trainable Sparse Attention
- URL: http://arxiv.org/abs/2505.13389v4
- Date: Mon, 04 Aug 2025 04:20:16 GMT
- Title: VSA: Faster Video Diffusion with Trainable Sparse Attention
- Authors: Peiyuan Zhang, Yongqi Chen, Haofeng Huang, Will Lin, Zhengzhong Liu, Ion Stoica, Eric Xing, Hao Zhang,
- Abstract summary: Scaling video diffusion transformers (DiTs) is limited by their quadratic 3D attention, even though most of the attention mass concentrates on a small subset of positions.<n>We turn this observation into VSA, a trainable, hardware-efficient sparse attention that replaces full attention at emphboth training and inference.
- Score: 21.593548582058403
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scaling video diffusion transformers (DiTs) is limited by their quadratic 3D attention, even though most of the attention mass concentrates on a small subset of positions. We turn this observation into VSA, a trainable, hardware-efficient sparse attention that replaces full attention at \emph{both} training and inference. In VSA, a lightweight coarse stage pools tokens into tiles and identifies high-weight \emph{critical tokens}; a fine stage computes token-level attention only inside those tiles subjecting to block computing layout to ensure hard efficiency. This leads to a single differentiable kernel that trains end-to-end, requires no post-hoc profiling, and sustains 85\% of FlashAttention3 MFU. We perform a large sweep of ablation studies and scaling-law experiments by pretraining DiTs from 60M to 1.4B parameters. VSA reaches a Pareto point that cuts training FLOPS by 2.53$\times$ with no drop in diffusion loss. Retrofitting the open-source Wan-2.1 model speeds up attention time by 6$\times$ and lowers end-to-end generation time from 31s to 18s with comparable quality. These results establish trainable sparse attention as a practical alternative to full attention and a key enabler for further scaling of video diffusion models. Code will be available at https://github.com/hao-ai-lab/FastVideo.
Related papers
- VMoBA: Mixture-of-Block Attention for Video Diffusion Models [29.183614108287276]
This paper introduces Video Mixture of Block Attention (VMoBA), a novel attention mechanism specifically adapted for Video Diffusion Models (VDMs)<n>Motivated by an in-depth analysis of attention patterns within pre-trained video transformers, VMoBA enhances the original MoBA framework with three key modifications.<n>Experiments demonstrate that VMoBA significantly accelerates the training of VDMs on longer sequences, achieving 2.92x FLOPs and 1.48x latency speedup.
arXiv Detail & Related papers (2025-06-30T13:52:31Z) - Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation [74.34633861289662]
Radial Attention is a scalable sparse attention mechanism with $O(n log n)$ complexity that translates energy decay into exponentially decaying compute density.<n>It maintains video quality across Wan2.1-14B, HunyuanVideo, and Mochi 1, achieving up to a 1.9$times$ speedup over the original dense attention.
arXiv Detail & Related papers (2025-06-24T17:59:59Z) - RainFusion: Adaptive Video Generation Acceleration via Multi-Dimensional Visual Redundancy [10.53687668536011]
RainFusion exploits inherent sparsity nature in visual data to accelerate attention computation while preserving video quality.<n>Our proposed bf RainFusion is a plug-and-play method that can be seamlessly integrated into state-of-the-art 3D-attention video generation models.
arXiv Detail & Related papers (2025-05-27T11:15:02Z) - Training-Free Efficient Video Generation via Dynamic Token Carving [54.52061549312799]
Jenga is an inference pipeline that combines dynamic attention carving with progressive resolution generation.<n>As a plug-and-play solution, Jenga enables practical, high-quality video generation on modern hardware.
arXiv Detail & Related papers (2025-05-22T16:21:32Z) - SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training [24.78957823032679]
We leverage the new FP4 Cores in Blackwell GPUs to accelerate attention computation.<n>Experiments show that our FP4 attention can accelerate inference of various models in a plug-and-play way.<n>We pioneer low-bit attention to training tasks.
arXiv Detail & Related papers (2025-05-16T18:01:54Z) - DSV: Exploiting Dynamic Sparsity to Accelerate Large-Scale Video DiT Training [85.04885553561164]
Diffusion Transformers (DiTs) have shown remarkable performance in generating high-quality videos.<n>DiTs can consume up to 95% of processing time and demand specialized context parallelism.<n>This paper introduces DSV to accelerate video DiT training by leveraging the dynamic attention sparsity we empirically observe.
arXiv Detail & Related papers (2025-02-11T14:39:59Z) - Efficient-vDiT: Efficient Video Diffusion Transformers With Attention Tile [28.913893318345384]
Diffusion Transformers (DiTs) with 3D full attention suffer from expensive inference due to the complexity of attention computation and numerous sampling steps.<n>This paper addresses the inefficiency issue from two aspects: 1) Prune the 3D full attention based on the redundancy within video data, and 2) Shorten the sampling process by adopting existing multi-step consistency distillation.
arXiv Detail & Related papers (2025-02-10T05:00:56Z) - Fast Video Generation with Sliding Tile Attention [19.47866950957766]
When generating just a 5-second 720P video, attention alone takes 800 out of 945 seconds of total inference time.<n>This paper introduces sliding tile attention (STA) to address this challenge.<n>STA operates tile-by-tile with a novel hardware-aware sliding window design.
arXiv Detail & Related papers (2025-02-06T21:17:09Z) - Diffusion Instruction Tuning [8.985668637331335]
Lavender is a simple supervised fine-tuning (SFT) method that boosts the performance of advanced vision-language models (VLMs)<n> Lavender aligns the text-vision attention in the VLM transformer with the equivalent used by Stable Diffusion during SFT.<n> Lavender requires just 0.13 million training examples, 2.5% of typical large-scale SFT datasets, and fine-tunes on standard hardware (8 GPU) in a single day.
arXiv Detail & Related papers (2025-02-04T22:20:20Z) - An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models [65.37846460916042]
We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs.
We introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency.
arXiv Detail & Related papers (2024-03-11T14:35:32Z) - Efficient Video Action Detection with Token Dropout and Context
Refinement [67.10895416008911]
We propose an end-to-end framework for efficient video action detection (ViTs)
In a video clip, we maintain tokens from its perspective while preserving tokens relevant to actor motions from other frames.
Second, we refine scene context by leveraging remaining tokens for better recognizing actor identities.
arXiv Detail & Related papers (2023-04-17T17:21:21Z) - Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain
Analysis: From Theory to Practice [111.47461527901318]
Vision Transformer (ViT) has recently demonstrated promise in computer vision problems.
ViT saturates quickly with depth increasing, due to the observed attention collapse or patch uniformity.
We propose two techniques to mitigate the undesirable low-pass limitation.
arXiv Detail & Related papers (2022-03-09T23:55:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.