Light Forcing: Accelerating Autoregressive Video Diffusion via Sparse Attention
- URL: http://arxiv.org/abs/2602.04789v1
- Date: Wed, 04 Feb 2026 17:41:53 GMT
- Title: Light Forcing: Accelerating Autoregressive Video Diffusion via Sparse Attention
- Authors: Chengtao Lv, Yumeng Shi, Yushi Huang, Ruihao Gong, Shen Ren, Wenya Wang,
- Abstract summary: textscLight Forcing is a textitfirst sparse attention solution tailored for AR video generation models.<n>It incorporates a textitChunk-Aware Growth mechanism to quantitatively estimate the contribution of each chunk.<n>We also introduce a textit Sparse Attention to capture informative historical and local context in a coarse-to-fine manner.
- Score: 28.598033369607723
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Advanced autoregressive (AR) video generation models have improved visual fidelity and interactivity, but the quadratic complexity of attention remains a primary bottleneck for efficient deployment. While existing sparse attention solutions have shown promise on bidirectional models, we identify that applying these solutions to AR models leads to considerable performance degradation for two reasons: isolated consideration of chunk generation and insufficient utilization of past informative context. Motivated by these observations, we propose \textsc{Light Forcing}, the \textit{first} sparse attention solution tailored for AR video generation models. It incorporates a \textit{Chunk-Aware Growth} mechanism to quantitatively estimate the contribution of each chunk, which determines their sparsity allocation. This progressive sparsity increase strategy enables the current chunk to inherit prior knowledge in earlier chunks during generation. Additionally, we introduce a \textit{Hierarchical Sparse Attention} to capture informative historical and local context in a coarse-to-fine manner. Such two-level mask selection strategy (\ie, frame and block level) can adaptively handle diverse attention patterns. Extensive experiments demonstrate that our method outperforms existing sparse attention in quality (\eg, 84.5 on VBench) and efficiency (\eg, $1.2{\sim}1.3\times$ end-to-end speedup). Combined with FP8 quantization and LightVAE, \textsc{Light Forcing} further achieves a $2.3\times$ speedup and 19.7\,FPS on an RTX~5090 GPU. Code will be released at \href{https://github.com/chengtao-lv/LightForcing}{https://github.com/chengtao-lv/LightForcing}.
Related papers
- SparVAR: Exploring Sparsity in Visual AutoRegressive Modeling for Training-Free Acceleration [23.86429472943524]
We present a training-free acceleration framework that exploits three properties of Visual AutoRegressive attention: strong attention sinks, cross-scale activation similarity, and pronounced locality.<n>Specifically, we dynamically predict the sparse attention pattern of later high-resolution scales from a sparse decision scale, and construct scale self-similar sparse attention via an efficient index-mapping mechanism.<n>Our method achieves a $mathbf1.57times$ speed-up while preserving almost all high-frequency details.
arXiv Detail & Related papers (2026-02-04T09:34:06Z) - VidLaDA: Bidirectional Diffusion Large Language Models for Efficient Video Understanding [52.69880888587866]
Current Video Large Language Models (Video LLMs) typically encode frames via a encoder vision and employ an autoregressive (AR) LLM for understanding and generation.<n>We propose VidLaDA, a Diffusion Video LLM based on Language Models (DLMs) that leverages bidirectional attention to unlock comprehensive modeling and decode tokens in parallel.<n>Experiments show VidLaDA rivals state-of-the-art AR baselines and outperforms DLM baselines, with MARS-Cache delivering over 12x speedup without compromising accuracy.
arXiv Detail & Related papers (2026-01-25T15:02:01Z) - Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation [69.57572900337176]
We introduce Reward Forcing, a novel framework for efficient streaming video generation.<n> EMA-Sink tokens capture both long-term context and recent dynamics, preventing initial frame copying.<n>Re-DMD biases the model's output distribution toward high-reward regions by prioritizing samples with greater dynamics rated by a vision-language model.
arXiv Detail & Related papers (2025-12-04T11:12:13Z) - Towards Redundancy Reduction in Diffusion Models for Efficient Video Super-Resolution [41.19210731686364]
Direct adapting generative diffusion models to video super-resolution (VSR) can result in redundancy.<n>OASIS is an efficient $textbfo$ne-step diffusion model with $textbfa$ttention $textbfs$pecialization for real-world v$textbfi$deo $textbfs$uper-resolution.<n>OASIS achieves state-of-the-art performance on both synthetic and real-world datasets.
arXiv Detail & Related papers (2025-09-28T17:08:51Z) - FuXi-β: Towards a Lightweight and Fast Large-Scale Generative Recommendation Model [87.38823851271758]
We propose a new framework for Transformer-like recommendation models.<n>FuXi-$beta$ outperforms previous state-of-the-art models and achieves significant acceleration.<n>Our code is available in a public repository: https://github.com/USTC-StarTeam/FuXi-beta$.
arXiv Detail & Related papers (2025-08-14T13:12:29Z) - Re-ttention: Ultra Sparse Visual Generation via Attention Statistical Reshape [38.76559841681518]
A huge bottleneck is the attention mechanism where complexity scales quadratically with resolution and video length.<n>Existing techniques fail to preserve visual quality at extremely high sparsity levels and might even incur non-negligible compute overheads.<n>We propose Re-ttention, which implements very high sparse attention for visual generation models.
arXiv Detail & Related papers (2025-05-28T22:39:12Z) - Training-Free Efficient Video Generation via Dynamic Token Carving [54.52061549312799]
Jenga is an inference pipeline that combines dynamic attention carving with progressive resolution generation.<n>As a plug-and-play solution, Jenga enables practical, high-quality video generation on modern hardware.
arXiv Detail & Related papers (2025-05-22T16:21:32Z) - FastCar: Cache Attentive Replay for Fast Auto-Regressive Video Generation on the Edge [60.000984252907195]
Auto-regressive (AR) models have recently shown promise in visual generation tasks due to their superior sampling efficiency.<n>Video generation requires a substantially larger number of tokens to produce coherent temporal frames, resulting in significant overhead during the decoding phase.<n>We propose the textbfFastCar framework to accelerate the decode phase for the AR video generation by exploring the temporal redundancy.
arXiv Detail & Related papers (2025-05-17T05:00:39Z) - Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free [81.65559031466452]
We conduct experiments to investigate gating-augmented softmax attention variants.<n>We find that a simple modification-applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)-consistently improves performance.
arXiv Detail & Related papers (2025-05-10T17:15:49Z) - DIFFVSGG: Diffusion-Driven Online Video Scene Graph Generation [61.59996525424585]
DIFFVSGG is an online VSGG solution that frames this task as an iterative scene graph update problem.<n>We unify the decoding of object classification, bounding box regression, and graph generation three tasks using one shared feature embedding.<n>DIFFVSGG further facilitates continuous temporal reasoning, where predictions for subsequent frames leverage results of past frames as the conditional inputs of LDMs.
arXiv Detail & Related papers (2025-03-18T06:49:51Z) - SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.