Fast Video Generation with Sliding Tile Attention
- URL: http://arxiv.org/abs/2502.04507v1
- Date: Thu, 06 Feb 2025 21:17:09 GMT
- Title: Fast Video Generation with Sliding Tile Attention
- Authors: Peiyuan Zhang, Yongqi Chen, Runlong Su, Hangliang Ding, Ion Stoica, Zhenghong Liu, Hao Zhang,
- Abstract summary: When generating just a 5-second 720P video, attention alone takes 800 out of 945 seconds of total inference time.<n>This paper introduces sliding tile attention (STA) to address this challenge.<n>STA operates tile-by-tile with a novel hardware-aware sliding window design.
- Score: 19.47866950957766
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion Transformers (DiTs) with 3D full attention power state-of-the-art video generation, but suffer from prohibitive compute cost -- when generating just a 5-second 720P video, attention alone takes 800 out of 945 seconds of total inference time. This paper introduces sliding tile attention (STA) to address this challenge. STA leverages the observation that attention scores in pretrained video diffusion models predominantly concentrate within localized 3D windows. By sliding and attending over the local spatial-temporal region, STA eliminates redundancy from full attention. Unlike traditional token-wise sliding window attention (SWA), STA operates tile-by-tile with a novel hardware-aware sliding window design, preserving expressiveness while being hardware-efficient. With careful kernel-level optimizations, STA offers the first efficient 2D/3D sliding-window-like attention implementation, achieving 58.79% MFU. Precisely, STA accelerates attention by 2.8-17x over FlashAttention-2 (FA2) and 1.6-10x over FlashAttention-3 (FA3). On the leading video DiT, HunyuanVideo, STA reduces end-to-end latency from 945s (FA3) to 685s without quality degradation, requiring no training. Enabling finetuning further lowers latency to 268s with only a 0.09% drop on VBench.
Related papers
- MonarchRT: Efficient Attention for Real-Time Video Generation [36.624688008552546]
We propose Monarch-RT, a structured a sparse attention parameterization for video diffusion models.<n>We achieve high expressivity while preserving computational efficiency.<n>Monarch-RT attains up to 95% attention sparsity with no loss in quality when applied to the state-of-the-art model Self-Forcing.
arXiv Detail & Related papers (2026-02-12T18:56:53Z) - MotionStream: Real-Time Video Generation with Interactive Motion Controls [60.403597895657505]
We present MotionStream, enabling sub-second latency with up to 29 FPS streaming generation on a single GPU.<n>Our approach begins by augmenting a text-to-video model with motion control, which generates high-quality videos that adhere to the global text prompt and local motion guidance, but does not perform inference on the fly.<n>Our models achieve state-of-the-art results in motion following and video quality while being two orders of magnitude faster, uniquely enabling infinite-length streaming.
arXiv Detail & Related papers (2025-11-03T06:37:53Z) - StreamingVLM: Real-Time Understanding for Infinite Video Streams [23.94087606884915]
StreamingVLM is a model designed for real-time, stable understanding of infinite visual input.<n>Our approach is a unified framework that aligns training with streaming inference.<n>On Inf-Streams-Eval, StreamingVLM achieves a 66.18% win rate against GPT-4O mini and maintains stable, real-time performance at up to 8 FPS on a single NVIDIA H100.
arXiv Detail & Related papers (2025-10-10T17:59:58Z) - SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention [88.47701139980636]
In Diffusion Transformer (DiT) models, particularly for video generation, attention latency is a major bottleneck.<n>We find that attention weights can be separated into two parts: a small fraction of large weights with high rank and the remaining weights with very low rank.<n>We propose SLA, a trainable attention method that fuses sparse and linear attention to accelerate diffusion models.
arXiv Detail & Related papers (2025-09-28T17:58:59Z) - Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation [74.34633861289662]
Radial Attention is a scalable sparse attention mechanism with $O(n log n)$ complexity that translates energy decay into exponentially decaying compute density.<n>It maintains video quality across Wan2.1-14B, HunyuanVideo, and Mochi 1, achieving up to a 1.9$times$ speedup over the original dense attention.
arXiv Detail & Related papers (2025-06-24T17:59:59Z) - Re-ttention: Ultra Sparse Visual Generation via Attention Statistical Reshape [23.01286982392074]
A huge bottleneck is the attention mechanism where complexity scales quadratically with resolution and video length.<n>Existing techniques fail to preserve visual quality at extremely high sparsity levels and might even incur non-negligible compute overheads.<n>We propose Re-ttention, which implements very high sparse attention for visual generation models.
arXiv Detail & Related papers (2025-05-28T22:39:12Z) - VORTA: Efficient Video Diffusion via Routing Sparse Attention [54.84294780326206]
VORTA is an acceleration framework with two novel components.<n>It achieves an end-to-end speedup $1.76times$ without loss of quality on VBench.<n>It can seamlessly integrate with various other acceleration methods, such as model caching and step distillation, reaching up to speedup $14.41times$ with negligible performance degradation.
arXiv Detail & Related papers (2025-05-24T17:46:47Z) - Training-Free Efficient Video Generation via Dynamic Token Carving [54.52061549312799]
Jenga is an inference pipeline that combines dynamic attention carving with progressive resolution generation.<n>As a plug-and-play solution, Jenga enables practical, high-quality video generation on modern hardware.
arXiv Detail & Related papers (2025-05-22T16:21:32Z) - VSA: Faster Video Diffusion with Trainable Sparse Attention [21.593548582058403]
Scaling video diffusion transformers (DiTs) is limited by their quadratic 3D attention, even though most of the attention mass concentrates on a small subset of positions.<n>We turn this observation into VSA, a trainable, hardware-efficient sparse attention that replaces full attention at emphboth training and inference.
arXiv Detail & Related papers (2025-05-19T17:30:13Z) - DraftAttention: Fast Video Diffusion via Low-Resolution Attention Guidance [43.423240627266644]
Diffusion transformer-based video generation models (DiTs) have recently attracted widespread attention for their excellent generation quality.<n>However, their computational cost remains a major bottleneck-attention alone accounts for over 80% of total latency.<n>We propose the DraftAttention, a training-free framework for the acceleration of video diffusion transformers with dynamic sparse attention on GPUs.
arXiv Detail & Related papers (2025-05-17T04:34:34Z) - Efficient-vDiT: Efficient Video Diffusion Transformers With Attention Tile [28.913893318345384]
Diffusion Transformers (DiTs) with 3D full attention suffer from expensive inference due to the complexity of attention computation and numerous sampling steps.
This paper addresses the inefficiency issue from two aspects: 1) Prune the 3D full attention based on the redundancy within video data, and 2) Shorten the sampling process by adopting existing multi-step consistency distillation.
arXiv Detail & Related papers (2025-02-10T05:00:56Z) - Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity [59.80405282381126]
Diffusion Transformers (DiTs) dominate video generation but their high computational cost severely limits real-world applicability.<n>We propose a training-free framework termed Sparse VideoGen (SVG) that leverages the inherent sparsity in 3D Full Attention to boost inference efficiency.<n>SVG achieves up to 2.28x and 2.33x end-to-end speedup on CogVideoX-v1.5 and HunyuanVideo, respectively, while preserving generation quality.
arXiv Detail & Related papers (2025-02-03T19:29:16Z) - RAIN: Real-time Animation of Infinite Video Stream [52.97171098038888]
RAIN is a pipeline solution capable of animating infinite video streams in real-time with low latency.
RAIN generates video frames with much shorter latency and faster speed, while maintaining long-range attention over extended video streams.
RAIN can animate characters in real-time with much better quality, accuracy, and consistency than competitors.
arXiv Detail & Related papers (2024-12-27T07:13:15Z) - V^3: Viewing Volumetric Videos on Mobiles via Streamable 2D Dynamic Gaussians [53.614560799043545]
V3 (Viewing Volumetric Videos) is a novel approach that enables high-quality mobile rendering through the streaming of dynamic Gaussians.
Our key innovation is to view dynamic 3DGS as 2D videos, facilitating the use of hardware video codecs.
As the first to stream dynamic Gaussians on mobile devices, our companion player offers users an unprecedented volumetric video experience.
arXiv Detail & Related papers (2024-09-20T16:54:27Z) - MaskVD: Region Masking for Efficient Video Object Detection [11.759503235646696]
Video tasks are compute-heavy and pose a challenge when deploying in real-time applications.
This paper presents a strategy for masking regions in video frames.
By leveraging extracted features from previous frames, ViT backbones directly benefit from region masking.
arXiv Detail & Related papers (2024-07-16T08:01:49Z) - FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision [14.426543629408984]
Attention is the bottleneck for large language models and long-context applications.
We develop three main techniques to speed up attention on GPUs.
We demonstrate that our method, FlashAttention-3, achieves speedup on H100 GPU by 1.5-2.0$times$ with FP16 reaching up to 740 TFLOPs/s (75% utilization) and with FP8 reaching close to 1.2 PFLOPs/s.
arXiv Detail & Related papers (2024-07-11T15:44:48Z) - DISTFLASHATTN: Distributed Memory-efficient Attention for Long-context LLMs Training [82.06732962485754]
FlashAttention effectively reduces the quadratic peak memory usage to linear in training transformer-based large language models (LLMs) on a single GPU.
We introduce DISTFLASHATTN, a memory-efficient attention mechanism optimized for long-context LLMs training.
It achieves 1.67x and 1.26 - 1.88x speedup compared to recent Ring Attention and DeepSpeed-Ulysses.
arXiv Detail & Related papers (2023-10-05T03:47:57Z) - CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video
Temporal Grounding [70.7882058229772]
This paper tackles an emerging and challenging problem of long video temporal grounding(VTG)
Compared with short videos, long videos are also highly demanded but less explored.
We propose CONE, an efficient COarse-to-fiNE alignment framework.
arXiv Detail & Related papers (2022-09-22T10:58:42Z) - DualFormer: Local-Global Stratified Transformer for Efficient Video
Recognition [140.66371549815034]
We propose a new transformer architecture, termed DualFormer, which can effectively and efficiently perform space-time attention for video recognition.
We show that DualFormer sets new state-of-the-art 82.9%/85.2% top-1 accuracy on Kinetics-400/600 with around 1000G inference FLOPs which is at least 3.2 times fewer than existing methods with similar performances.
arXiv Detail & Related papers (2021-12-09T03:05:19Z) - TSM: Temporal Shift Module for Efficient and Scalable Video
Understanding on Edge Device [58.776352999540435]
We propose a generic and effective Temporal Shift Module (TSM) that enjoys both high efficiency and high performance.
TSM is inserted into 2D CNNs to achieve temporal modeling at zero computation and zero parameters.
It achieves a high frame rate of 74 fps and 29 fps for online video recognition on Jetson Nano and Galaxy Note8.
arXiv Detail & Related papers (2021-09-27T17:59:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.