VMonarch: Efficient Video Diffusion Transformers with Structured Attention
- URL: http://arxiv.org/abs/2601.22275v1
- Date: Thu, 29 Jan 2026 19:48:13 GMT
- Title: VMonarch: Efficient Video Diffusion Transformers with Structured Attention
- Authors: Cheng Liang, Haoxian Chen, Liang Hou, Qi Fan, Gangshan Wu, Xin Tao, Limin Wang,
- Abstract summary: We find that the highly sparse-temporal attention patterns exhibited in Video DiTs can be naturally represented by the Monarch matrix.<n>We propose VMonarch, a novel attention mechanism for Video DiTs that enables efficient minimization over the dynamic sparse patterns.
- Score: 49.26162294859424
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The quadratic complexity of the attention mechanism severely limits the context scalability of Video Diffusion Transformers (DiTs). We find that the highly sparse spatio-temporal attention patterns exhibited in Video DiTs can be naturally represented by the Monarch matrix. It is a class of structured matrices with flexible sparsity, enabling sub-quadratic attention via an alternating minimization algorithm. Accordingly, we propose VMonarch, a novel attention mechanism for Video DiTs that enables efficient computation over the dynamic sparse patterns with structured Monarch matrices. First, we adapt spatio-temporal Monarch factorization to explicitly capture the intra-frame and inter-frame correlations of the video data. Second, we introduce a recomputation strategy to mitigate artifacts arising from instabilities during alternating minimization of Monarch matrices. Third, we propose a novel online entropy algorithm fused into FlashAttention, enabling fast Monarch matrix updates for long sequences. Extensive experiments demonstrate that VMonarch achieves comparable or superior generation quality to full attention on VBench after minimal tuning. It overcomes the attention bottleneck in Video DiTs, reduces attention FLOPs by a factor of 17.5, and achieves a speedup of over 5x in attention computation for long videos, surpassing state-of-the-art sparse attention methods at 90% sparsity.
Related papers
- MonarchRT: Efficient Attention for Real-Time Video Generation [36.624688008552546]
We propose Monarch-RT, a structured a sparse attention parameterization for video diffusion models.<n>We achieve high expressivity while preserving computational efficiency.<n>Monarch-RT attains up to 95% attention sparsity with no loss in quality when applied to the state-of-the-art model Self-Forcing.
arXiv Detail & Related papers (2026-02-12T18:56:53Z) - PAS: A Training-Free Stabilizer for Temporal Encoding in Video LLMs [57.790910044227935]
Video LLMs suffer from temporal inconsistency: small shifts in frame timing can flip attention and suppress relevant frames.<n>We present Phase Aggregated Smoothing (PAS), a training-free mechanism that applies small opposed phase offsets across heads and then aggregates their outputs.<n>Our analysis shows that the RoPE rotated logit can be approximated as a content dot product scaled by a time kernel; smoothing this kernel yields Lipschitz stability of attention to small temporal shifts; multi phase averaging attenuates high frequency ripples while preserving per-head spectra under Nyquist-valid sampling.
arXiv Detail & Related papers (2025-11-14T05:56:47Z) - Compact Attention: Exploiting Structured Spatio-Temporal Sparsity for Fast Video Generation [21.87891961960399]
Compact Attention is a hardware-aware acceleration framework featuring three innovations.<n>Our method achieves 1.62.5x acceleration in attention on single- GPU setups.<n>This work provides a principled approach to unlocking efficient long-form video generation through structured sparsity exploitation.
arXiv Detail & Related papers (2025-08-18T14:45:42Z) - Sparse-vDiT: Unleashing the Power of Sparse Attention to Accelerate Video Diffusion Transformers [24.105473321347894]
We propose Sparse-vDiT, a sparsity acceleration framework for Video Diffusion Transformer (vDiT)<n>We show that Sparse-vDiT achieves 2.09$times$, 2.38$times$, and 1.67$times$ theoretical FLOP reduction, and actual inference speedups of 1.76$times$, 1.85$times$, and 1.58$times$, respectively.<n>Our work demonstrates that latent structural sparsity in vDiTs can be systematically exploited for long video synthesis.
arXiv Detail & Related papers (2025-06-03T16:42:37Z) - Re-ttention: Ultra Sparse Visual Generation via Attention Statistical Reshape [38.76559841681518]
A huge bottleneck is the attention mechanism where complexity scales quadratically with resolution and video length.<n>Existing techniques fail to preserve visual quality at extremely high sparsity levels and might even incur non-negligible compute overheads.<n>We propose Re-ttention, which implements very high sparse attention for visual generation models.
arXiv Detail & Related papers (2025-05-28T22:39:12Z) - VORTA: Efficient Video Diffusion via Routing Sparse Attention [54.84294780326206]
VORTA is an acceleration framework with two novel components.<n>It achieves an end-to-end speedup $1.76times$ without loss of quality on VBench.<n>It can seamlessly integrate with various other acceleration methods, such as model caching and step distillation, reaching up to speedup $14.41times$ with negligible performance degradation.
arXiv Detail & Related papers (2025-05-24T17:46:47Z) - MonarchAttention: Zero-Shot Conversion to Fast, Hardware-Aware Structured Attention [10.244490009712466]
We propose a novel approach to sub-quadratic attention approximation via Monarch matrices.<n>MonarchAttention is both transferable, yielding minimal performance loss with no additional training, and hardware-efficient.<n>We demonstrate the quality of MonarchAttention on diverse tasks and architectures in vision and language problems.
arXiv Detail & Related papers (2025-05-24T13:44:44Z) - DBA: Efficient Transformer with Dynamic Bilinear Low-Rank Attention [53.02648818164273]
We present an efficient yet effective attention mechanism, namely the Dynamic Bilinear Low-Rank Attention (DBA)
DBA compresses the sequence length by input-sensitive dynamic projection matrices and achieves linear time and space complexity.
Experiments over tasks with diverse sequence length conditions show that DBA achieves state-of-the-art performance.
arXiv Detail & Related papers (2022-11-24T03:06:36Z) - Monarch: Expressive Structured Matrices for Efficient and Accurate
Training [64.6871423399431]
Large neural networks excel in many domains, but they are expensive to train and fine-tune.
A popular approach to reduce their compute or memory requirements is to replace dense weight matrices with structured ones.
We propose a class of matrices (Monarch) that is hardware-efficient.
arXiv Detail & Related papers (2022-04-01T17:37:29Z) - Revisiting Dynamic Convolution via Matrix Decomposition [81.89967403872147]
We propose dynamic channel fusion to replace dynamic attention over channel groups.
Our method is easier to train and requires significantly fewer parameters without sacrificing accuracy.
arXiv Detail & Related papers (2021-03-15T23:03:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.