Related papers: VORTA: Efficient Video Diffusion via Routing Sparse Attention

VORTA: Efficient Video Diffusion via Routing Sparse Attention

URL: http://arxiv.org/abs/2505.18809v1
Date: Sat, 24 May 2025 17:46:47 GMT
Title: VORTA: Efficient Video Diffusion via Routing Sparse Attention
Authors: Wenhao Sun, Rong-Cheng Tu, Yifu Ding, Zhao Jin, Jingyi Liao, Shunyu Liu, Dacheng Tao,
Abstract summary: Video Diffusion Transformers (VDiTs) have achieved remarkable progress in high-quality video generation, but remain computationally expensive.<n>We propose VORTA, an acceleration framework with two novel components.<n>It achieves a $1.76times$ end-to-end speedup without quality loss on VBench.
Score: 45.269274789183974
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video Diffusion Transformers (VDiTs) have achieved remarkable progress in high-quality video generation, but remain computationally expensive due to the quadratic complexity of attention over high-dimensional video sequences. Recent attention acceleration methods leverage the sparsity of attention patterns to improve efficiency; however, they often overlook inefficiencies of redundant long-range interactions. To address this problem, we propose \textbf{VORTA}, an acceleration framework with two novel components: 1) a sparse attention mechanism that efficiently captures long-range dependencies, and 2) a routing strategy that adaptively replaces full 3D attention with specialized sparse attention variants throughout the sampling process. It achieves a $1.76\times$ end-to-end speedup without quality loss on VBench. Furthermore, VORTA can seamlessly integrate with various other acceleration methods, such as caching and step distillation, reaching up to $14.41\times$ speedup with negligible performance degradation. VORTA demonstrates its efficiency and enhances the practicality of VDiTs in real-world settings.

Related papers

VMoBA: Mixture-of-Block Attention for Video Diffusion Models [29.183614108287276]
This paper introduces Video Mixture of Block Attention (VMoBA), a novel attention mechanism specifically adapted for Video Diffusion Models (VDMs)<n>Motivated by an in-depth analysis of attention patterns within pre-trained video transformers, VMoBA enhances the original MoBA framework with three key modifications.<n>Experiments demonstrate that VMoBA significantly accelerates the training of VDMs on longer sequences, achieving 2.92x FLOPs and 1.48x latency speedup.
arXiv Detail & Related papers (2025-06-30T13:52:31Z)
FCA2: Frame Compression-Aware Autoencoder for Modular and Fast Compressed Video Super-Resolution [68.77813885751308]
State-of-the-art (SOTA) compressed video super-resolution (CVSR) models face persistent challenges, including prolonged inference time, complex training pipelines, and reliance on auxiliary information.<n>We propose an efficient and scalable solution inspired by the structural and statistical similarities between hyperspectral images (HSI) and video data.<n>Our approach introduces a compression-driven dimensionality reduction strategy that reduces computational complexity, accelerates inference, and enhances the extraction of temporal information across frames.
arXiv Detail & Related papers (2025-06-13T07:59:52Z)
FEAT: Full-Dimensional Efficient Attention Transformer for Medical Video Generation [14.903360987684483]
We propose FEAT, a full-dimensional efficient attention Transformer for high-quality dynamic medical videos.<n>We evaluate FEAT on standard benchmarks and downstream tasks, demonstrating that FEAT-S, with only 23% of the parameters of the state-of-the-art model Endora, achieves comparable or even superior performance.
arXiv Detail & Related papers (2025-06-05T12:31:02Z)
Sparse-vDiT: Unleashing the Power of Sparse Attention to Accelerate Video Diffusion Transformers [24.105473321347894]
We propose Sparse-vDiT, a sparsity acceleration framework for Video Diffusion Transformer (vDiT)<n>We show that Sparse-vDiT achieves 2.09$times$, 2.38$times$, and 1.67$times$ theoretical FLOP reduction, and actual inference speedups of 1.76$times$, 1.85$times$, and 1.58$times$, respectively.<n>Our work demonstrates that latent structural sparsity in vDiTs can be systematically exploited for long video synthesis.
arXiv Detail & Related papers (2025-06-03T16:42:37Z)
SRDiffusion: Accelerate Video Diffusion Inference via Sketching-Rendering Cooperation [26.045123066151838]
SRDiffusion is a novel framework that leverages collaboration between large and small models to reduce inference cost.<n>Our method is introduced as a new direction to existing acceleration strategies, offering a practical solution for scalable video generation.
arXiv Detail & Related papers (2025-05-25T13:58:52Z)
Training-Free Efficient Video Generation via Dynamic Token Carving [54.52061549312799]
Jenga is an inference pipeline that combines dynamic attention carving with progressive resolution generation.<n>As a plug-and-play solution, Jenga enables practical, high-quality video generation on modern hardware.
arXiv Detail & Related papers (2025-05-22T16:21:32Z)
DiVE: Efficient Multi-View Driving Scenes Generation Based on Video Diffusion Transformer [56.98400572837792]
DiVE produces high-fidelity, temporally coherent, and cross-view consistent multi-view videos.<n>These innovations collectively achieve a 2.62x speedup with minimal quality degradation.
arXiv Detail & Related papers (2025-04-28T09:20:50Z)
DSV: Exploiting Dynamic Sparsity to Accelerate Large-Scale Video DiT Training [85.04885553561164]
Diffusion Transformers (DiTs) have shown remarkable performance in generating high-quality videos.<n>DiTs can consume up to 95% of processing time and demand specialized context parallelism.<n>This paper introduces DSV to accelerate video DiT training by leveraging the dynamic attention sparsity we empirically observe.
arXiv Detail & Related papers (2025-02-11T14:39:59Z)
VideoLifter: Lifting Videos to 3D with Fast Hierarchical Stereo Alignment [54.66217340264935]
VideoLifter is a novel video-to-3D pipeline that leverages a local-to-global strategy on a fragment basis.<n>It significantly accelerates the reconstruction process, reducing training time by over 82% while holding better visual quality than current SOTA methods.
arXiv Detail & Related papers (2025-01-03T18:52:36Z)
AsymRnR: Video Diffusion Transformers Acceleration with Asymmetric Reduction and Restoration [45.62669899834342]
Diffusion Transformers (DiTs) have proven effective in generating high-quality videos but are hindered by high computational costs.<n>We propose Asymmetric Reduction and Restoration (AsymRnR), a training-free and model-agnostic method to accelerate video DiTs.
arXiv Detail & Related papers (2024-12-16T12:28:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.