Fast Video Generation with Sliding Tile Attention
- URL: http://arxiv.org/abs/2502.04507v1
- Date: Thu, 06 Feb 2025 21:17:09 GMT
- Title: Fast Video Generation with Sliding Tile Attention
- Authors: Peiyuan Zhang, Yongqi Chen, Runlong Su, Hangliang Ding, Ion Stoica, Zhenghong Liu, Hao Zhang,
- Abstract summary: When generating just a 5-second 720P video, attention alone takes 800 out of 945 seconds of total inference time.<n>This paper introduces sliding tile attention (STA) to address this challenge.<n>STA operates tile-by-tile with a novel hardware-aware sliding window design.
- Score: 19.47866950957766
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion Transformers (DiTs) with 3D full attention power state-of-the-art video generation, but suffer from prohibitive compute cost -- when generating just a 5-second 720P video, attention alone takes 800 out of 945 seconds of total inference time. This paper introduces sliding tile attention (STA) to address this challenge. STA leverages the observation that attention scores in pretrained video diffusion models predominantly concentrate within localized 3D windows. By sliding and attending over the local spatial-temporal region, STA eliminates redundancy from full attention. Unlike traditional token-wise sliding window attention (SWA), STA operates tile-by-tile with a novel hardware-aware sliding window design, preserving expressiveness while being hardware-efficient. With careful kernel-level optimizations, STA offers the first efficient 2D/3D sliding-window-like attention implementation, achieving 58.79% MFU. Precisely, STA accelerates attention by 2.8-17x over FlashAttention-2 (FA2) and 1.6-10x over FlashAttention-3 (FA3). On the leading video DiT, HunyuanVideo, STA reduces end-to-end latency from 945s (FA3) to 685s without quality degradation, requiring no training. Enabling finetuning further lowers latency to 268s with only a 0.09% drop on VBench.
Related papers
- Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation [74.34633861289662]
Radial Attention is a scalable sparse attention mechanism with $O(n log n)$ complexity that translates energy decay into exponentially decaying compute density.<n>It maintains video quality across Wan2.1-14B, HunyuanVideo, and Mochi 1, achieving up to a 1.9$times$ speedup over the original dense attention.
arXiv Detail & Related papers (2025-06-24T17:59:59Z) - Re-ttention: Ultra Sparse Visual Generation via Attention Statistical Reshape [23.01286982392074]
A huge bottleneck is the attention mechanism where complexity scales quadratically with resolution and video length.<n>Existing techniques fail to preserve visual quality at extremely high sparsity levels and might even incur non-negligible compute overheads.<n>We propose Re-ttention, which implements very high sparse attention for visual generation models.
arXiv Detail & Related papers (2025-05-28T22:39:12Z) - Training-Free Efficient Video Generation via Dynamic Token Carving [54.52061549312799]
Jenga is an inference pipeline that combines dynamic attention carving with progressive resolution generation.<n>As a plug-and-play solution, Jenga enables practical, high-quality video generation on modern hardware.
arXiv Detail & Related papers (2025-05-22T16:21:32Z) - VSA: Faster Video Diffusion with Trainable Sparse Attention [21.593548582058403]
Scaling video diffusion transformers (DiTs) is limited by their quadratic 3D attention, even though most of the attention mass concentrates on a small subset of positions.<n>We turn this observation into VSA, a trainable, hardware-efficient sparse attention that replaces full attention at emphboth training and inference.
arXiv Detail & Related papers (2025-05-19T17:30:13Z) - DraftAttention: Fast Video Diffusion via Low-Resolution Attention Guidance [43.423240627266644]
Diffusion transformer-based video generation models (DiTs) have recently attracted widespread attention for their excellent generation quality.<n>However, their computational cost remains a major bottleneck-attention alone accounts for over 80% of total latency.<n>We propose the DraftAttention, a training-free framework for the acceleration of video diffusion transformers with dynamic sparse attention on GPUs.
arXiv Detail & Related papers (2025-05-17T04:34:34Z) - Efficient-vDiT: Efficient Video Diffusion Transformers With Attention Tile [28.913893318345384]
Diffusion Transformers (DiTs) with 3D full attention suffer from expensive inference due to the complexity of attention computation and numerous sampling steps.
This paper addresses the inefficiency issue from two aspects: 1) Prune the 3D full attention based on the redundancy within video data, and 2) Shorten the sampling process by adopting existing multi-step consistency distillation.
arXiv Detail & Related papers (2025-02-10T05:00:56Z) - Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity [59.80405282381126]
Diffusion Transformers (DiTs) dominate video generation but their high computational cost severely limits real-world applicability.<n>We propose a training-free framework termed Sparse VideoGen (SVG) that leverages the inherent sparsity in 3D Full Attention to boost inference efficiency.<n>SVG achieves up to 2.28x and 2.33x end-to-end speedup on CogVideoX-v1.5 and HunyuanVideo, respectively, while preserving generation quality.
arXiv Detail & Related papers (2025-02-03T19:29:16Z) - RAIN: Real-time Animation of Infinite Video Stream [52.97171098038888]
RAIN is a pipeline solution capable of animating infinite video streams in real-time with low latency.
RAIN generates video frames with much shorter latency and faster speed, while maintaining long-range attention over extended video streams.
RAIN can animate characters in real-time with much better quality, accuracy, and consistency than competitors.
arXiv Detail & Related papers (2024-12-27T07:13:15Z) - V^3: Viewing Volumetric Videos on Mobiles via Streamable 2D Dynamic Gaussians [53.614560799043545]
V3 (Viewing Volumetric Videos) is a novel approach that enables high-quality mobile rendering through the streaming of dynamic Gaussians.
Our key innovation is to view dynamic 3DGS as 2D videos, facilitating the use of hardware video codecs.
As the first to stream dynamic Gaussians on mobile devices, our companion player offers users an unprecedented volumetric video experience.
arXiv Detail & Related papers (2024-09-20T16:54:27Z) - MaskVD: Region Masking for Efficient Video Object Detection [11.759503235646696]
Video tasks are compute-heavy and pose a challenge when deploying in real-time applications.
This paper presents a strategy for masking regions in video frames.
By leveraging extracted features from previous frames, ViT backbones directly benefit from region masking.
arXiv Detail & Related papers (2024-07-16T08:01:49Z) - FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision [14.426543629408984]
Attention is the bottleneck for large language models and long-context applications.
We develop three main techniques to speed up attention on GPUs.
We demonstrate that our method, FlashAttention-3, achieves speedup on H100 GPU by 1.5-2.0$times$ with FP16 reaching up to 740 TFLOPs/s (75% utilization) and with FP8 reaching close to 1.2 PFLOPs/s.
arXiv Detail & Related papers (2024-07-11T15:44:48Z) - DISTFLASHATTN: Distributed Memory-efficient Attention for Long-context LLMs Training [82.06732962485754]
FlashAttention effectively reduces the quadratic peak memory usage to linear in training transformer-based large language models (LLMs) on a single GPU.
We introduce DISTFLASHATTN, a memory-efficient attention mechanism optimized for long-context LLMs training.
It achieves 1.67x and 1.26 - 1.88x speedup compared to recent Ring Attention and DeepSpeed-Ulysses.
arXiv Detail & Related papers (2023-10-05T03:47:57Z) - CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video
Temporal Grounding [70.7882058229772]
This paper tackles an emerging and challenging problem of long video temporal grounding(VTG)
Compared with short videos, long videos are also highly demanded but less explored.
We propose CONE, an efficient COarse-to-fiNE alignment framework.
arXiv Detail & Related papers (2022-09-22T10:58:42Z) - DualFormer: Local-Global Stratified Transformer for Efficient Video
Recognition [140.66371549815034]
We propose a new transformer architecture, termed DualFormer, which can effectively and efficiently perform space-time attention for video recognition.
We show that DualFormer sets new state-of-the-art 82.9%/85.2% top-1 accuracy on Kinetics-400/600 with around 1000G inference FLOPs which is at least 3.2 times fewer than existing methods with similar performances.
arXiv Detail & Related papers (2021-12-09T03:05:19Z) - TSM: Temporal Shift Module for Efficient and Scalable Video
Understanding on Edge Device [58.776352999540435]
We propose a generic and effective Temporal Shift Module (TSM) that enjoys both high efficiency and high performance.
TSM is inserted into 2D CNNs to achieve temporal modeling at zero computation and zero parameters.
It achieves a high frame rate of 74 fps and 29 fps for online video recognition on Jetson Nano and Galaxy Note8.
arXiv Detail & Related papers (2021-09-27T17:59:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.