Related papers: MonarchRT: Efficient Attention for Real-Time Video Generation

MonarchRT: Efficient Attention for Real-Time Video Generation

URL: http://arxiv.org/abs/2602.12271v1
Date: Thu, 12 Feb 2026 18:56:53 GMT
Title: MonarchRT: Efficient Attention for Real-Time Video Generation
Authors: Krish Agarwal, Zhuoming Chen, Cheng Luo, Yongqi Chen, Haizhong Zheng, Xun Huang, Atri Rudra, Beidi Chen,
Abstract summary: We propose Monarch-RT, a structured a sparse attention parameterization for video diffusion models.<n>We achieve high expressivity while preserving computational efficiency.<n>Monarch-RT attains up to 95% attention sparsity with no loss in quality when applied to the state-of-the-art model Self-Forcing.
Score: 36.624688008552546
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Real-time video generation with Diffusion Transformers is bottlenecked by the quadratic cost of 3D self-attention, especially in real-time regimes that are both few-step and autoregressive, where errors compound across time and each denoising step must carry substantially more information. In this setting, we find that prior sparse-attention approximations break down, despite showing strong results for bidirectional, many-step diffusion. Specifically, we observe that video attention is not reliably sparse, but instead combines pronounced periodic structure driven by spatiotemporal position with dynamic, sparse semantic correspondences and dense mixing, exceeding the representational capacity of even oracle top-k attention. Building on this insight, we propose Monarch-RT, a structured attention parameterization for video diffusion models that factorizes attention using Monarch matrices. Through appropriately aligned block structure and our extended tiled Monarch parameterization, we achieve high expressivity while preserving computational efficiency. We further overcome the overhead of parameterization through finetuning, with custom Triton kernels. We first validate the high efficacy of Monarch-RT over existing sparse baselines designed only for bidirectional models. We further observe that Monarch-RT attains up to 95% attention sparsity with no loss in quality when applied to the state-of-the-art model Self-Forcing, making Monarch-RT a pioneering work on highly-capable sparse attention parameterization for real-time video generation. Our optimized implementation outperforms FlashAttention-2, FlashAttention-3, and FlashAttention-4 kernels on Nvidia RTX 5090, H100, and B200 GPUs respectively, providing kernel speedups in the range of 1.4-11.8X. This enables us, for the first time, to achieve true real-time video generation with Self-Forcing at 16 FPS on a single RTX 5090.

Related papers

VMonarch: Efficient Video Diffusion Transformers with Structured Attention [49.26162294859424]
We find that the highly sparse-temporal attention patterns exhibited in Video DiTs can be naturally represented by the Monarch matrix.<n>We propose VMonarch, a novel attention mechanism for Video DiTs that enables efficient minimization over the dynamic sparse patterns.
arXiv Detail & Related papers (2026-01-29T19:48:13Z)
Real-Time Motion-Controllable Autoregressive Video Diffusion [79.32730467857535]
We propose AR-Drag, the first RL-enhanced few-step AR video diffusion model for real-time image-to-video generation with diverse motion control.<n>We first fine-tune a base I2V model to support basic motion control, then further improve it via reinforcement with a trajectory-based reward model.<n>Our design preserves the Markov property through a Self-Rollout learning mechanism and accelerates training by selectively denoising steps.
arXiv Detail & Related papers (2025-10-09T12:17:11Z)
Lightning Fast Caching-based Parallel Denoising Prediction for Accelerating Talking Head Generation [50.04968365065964]
Diffusion-based talking head models generate high-quality, photorealistic videos but suffer from slow inference.<n>We introduce Lightning-fast Caching-based Parallel denoising prediction (LightningCP)<n>We also propose Decoupled Foreground Attention (DFA) to further accelerate attention computations.
arXiv Detail & Related papers (2025-08-25T02:58:39Z)
Compact Attention: Exploiting Structured Spatio-Temporal Sparsity for Fast Video Generation [21.87891961960399]
Compact Attention is a hardware-aware acceleration framework featuring three innovations.<n>Our method achieves 1.62.5x acceleration in attention on single- GPU setups.<n>This work provides a principled approach to unlocking efficient long-form video generation through structured sparsity exploitation.
arXiv Detail & Related papers (2025-08-18T14:45:42Z)
Sparse-vDiT: Unleashing the Power of Sparse Attention to Accelerate Video Diffusion Transformers [24.105473321347894]
We propose Sparse-vDiT, a sparsity acceleration framework for Video Diffusion Transformer (vDiT)<n>We show that Sparse-vDiT achieves 2.09$times$, 2.38$times$, and 1.67$times$ theoretical FLOP reduction, and actual inference speedups of 1.76$times$, 1.85$times$, and 1.58$times$, respectively.<n>Our work demonstrates that latent structural sparsity in vDiTs can be systematically exploited for long video synthesis.
arXiv Detail & Related papers (2025-06-03T16:42:37Z)
Re-ttention: Ultra Sparse Visual Generation via Attention Statistical Reshape [38.76559841681518]
A huge bottleneck is the attention mechanism where complexity scales quadratically with resolution and video length.<n>Existing techniques fail to preserve visual quality at extremely high sparsity levels and might even incur non-negligible compute overheads.<n>We propose Re-ttention, which implements very high sparse attention for visual generation models.
arXiv Detail & Related papers (2025-05-28T22:39:12Z)
RainFusion: Adaptive Video Generation Acceleration via Multi-Dimensional Visual Redundancy [10.53687668536011]
RainFusion exploits inherent sparsity nature in visual data to accelerate attention computation while preserving video quality.<n>Our proposed bf RainFusion is a plug-and-play method that can be seamlessly integrated into state-of-the-art 3D-attention video generation models.
arXiv Detail & Related papers (2025-05-27T11:15:02Z)
FastCar: Cache Attentive Replay for Fast Auto-Regressive Video Generation on the Edge [60.000984252907195]
Auto-regressive (AR) models have recently shown promise in visual generation tasks due to their superior sampling efficiency.<n>Video generation requires a substantially larger number of tokens to produce coherent temporal frames, resulting in significant overhead during the decoding phase.<n>We propose the textbfFastCar framework to accelerate the decode phase for the AR video generation by exploring the temporal redundancy.
arXiv Detail & Related papers (2025-05-17T05:00:39Z)
Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity [59.80405282381126]
Diffusion Transformers (DiTs) dominate video generation but their high computational cost severely limits real-world applicability.<n>We propose a training-free framework termed Sparse VideoGen (SVG) that leverages the inherent sparsity in 3D Full Attention to boost inference efficiency.<n>SVG achieves up to 2.28x and 2.33x end-to-end speedup on CogVideoX-v1.5 and HunyuanVideo, respectively, while preserving generation quality.
arXiv Detail & Related papers (2025-02-03T19:29:16Z)
From Slow Bidirectional to Fast Autoregressive Video Diffusion Models [48.35054927704544]
Current video diffusion models achieve impressive generation quality but struggle in interactive applications due to bidirectional attention dependencies.<n>We address this limitation by adapting a pretrained bidirectional diffusion transformer to an autoregressive transformer that generates frames on-the-fly.<n>Our model achieves a total score of 84.27 on the VBench-Long benchmark, surpassing all previous video generation models.
arXiv Detail & Related papers (2024-12-10T18:59:50Z)
Implicit Temporal Modeling with Learnable Alignment for Video Recognition [95.82093301212964]
We propose a novel Implicit Learnable Alignment (ILA) method, which minimizes the temporal modeling effort while achieving incredibly high performance. ILA achieves a top-1 accuracy of 88.7% on Kinetics-400 with much fewer FLOPs compared with Swin-L and ViViT-H.
arXiv Detail & Related papers (2023-04-20T17:11:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.