LinVideo: A Post-Training Framework towards O(n) Attention in Efficient Video Generation
- URL: http://arxiv.org/abs/2510.08318v1
- Date: Thu, 09 Oct 2025 15:03:39 GMT
- Title: LinVideo: A Post-Training Framework towards O(n) Attention in Efficient Video Generation
- Authors: Yushi Huang, Xingtong Ge, Ruihao Gong, Chengtao Lv, Jun Zhang,
- Abstract summary: We present an efficient data-free post-training framework that replaces target number of self-attention modules with linear attention.<n>Our method achieves a 1.25-2.00x speedup while preserving generation quality and our 4-step distilled model further delivers a 15.92x latency reduction with minimal visual quality drop.
- Score: 19.69237169218592
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video diffusion models (DMs) have enabled high-quality video synthesis. However, their computation costs scale quadratically with sequence length because self-attention has quadratic complexity. While linear attention lowers the cost, fully replacing quadratic attention requires expensive pretraining due to the limited expressiveness of linear attention and the complexity of spatiotemporal modeling in video generation. In this paper, we present LinVideo, an efficient data-free post-training framework that replaces a target number of self-attention modules with linear attention while preserving the original model's performance. First, we observe a significant disparity in the replaceability of different layers. Instead of manual or heuristic choices, we frame layer selection as a binary classification problem and propose selective transfer, which automatically and progressively converts layers to linear attention with minimal performance impact. Additionally, to overcome the ineffectiveness and inefficiency of existing objectives for this transfer process, we introduce an anytime distribution matching (ADM) objective that aligns the distributions of samples across any timestep along the sampling trajectory. This objective is efficient and recovers model performance. Extensive experiments show that our method achieves a 1.25-2.00x speedup while preserving generation quality, and our 4-step distilled model further delivers a 15.92x latency reduction with minimal visual quality drop.
Related papers
- Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation [69.57572900337176]
We introduce Reward Forcing, a novel framework for efficient streaming video generation.<n> EMA-Sink tokens capture both long-term context and recent dynamics, preventing initial frame copying.<n>Re-DMD biases the model's output distribution toward high-reward regions by prioritizing samples with greater dynamics rated by a vision-language model.
arXiv Detail & Related papers (2025-12-04T11:12:13Z) - Attention Surgery: An Efficient Recipe to Linearize Your Video Diffusion Transformer [13.545000689565732]
Transformer-based video diffusion models (VDMs) deliver state-of-the-art video generation quality but are constrained by the quadratic cost of self-attention.<n>We introduce textitAttention Surgery, an efficient framework for textitlinearizing or textithybridizing attention in pretrained VDMs without training from scratch.
arXiv Detail & Related papers (2025-09-29T15:09:51Z) - Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion [70.4360995984905]
We introduce Self Forcing, a novel training paradigm for autoregressive video diffusion models.<n>It addresses the longstanding issue of exposure bias, where models trained on ground-truth context must generate sequences conditioned on their own imperfect outputs.
arXiv Detail & Related papers (2025-06-09T17:59:55Z) - Autoregressive Video Generation without Vector Quantization [90.87907377618747]
We reformulate the video generation problem as a non-quantized autoregressive modeling of temporal frame-by-frame prediction.<n>With the proposed approach, we train a novel video autoregressive model without vector quantization, termed NOVA.<n>Our results demonstrate that NOVA surpasses prior autoregressive video models in data efficiency, inference speed, visual fidelity, and video fluency, even with a much smaller model capacity.
arXiv Detail & Related papers (2024-12-18T18:59:53Z) - From Slow Bidirectional to Fast Autoregressive Video Diffusion Models [48.35054927704544]
Current video diffusion models achieve impressive generation quality but struggle in interactive applications due to bidirectional attention dependencies.<n>We address this limitation by adapting a pretrained bidirectional diffusion transformer to an autoregressive transformer that generates frames on-the-fly.<n>Our model achieves a total score of 84.27 on the VBench-Long benchmark, surpassing all previous video generation models.
arXiv Detail & Related papers (2024-12-10T18:59:50Z) - Faster Image2Video Generation: A Closer Look at CLIP Image Embedding's Impact on Spatio-Temporal Cross-Attentions [27.111140222002653]
This paper investigates the role of CLIP image embeddings within the Stable Video Diffusion (SVD) framework.
We introduce the VCUT, a training-free approach optimized for efficiency within the SVD architecture.
The implementation of VCUT leads to a reduction of up to 322T Multiple-Accumulate Operations (MACs) per video and a decrease in model parameters by up to 50M, achieving a 20% reduction in latency compared to the baseline.
arXiv Detail & Related papers (2024-07-27T08:21:14Z) - A-SDM: Accelerating Stable Diffusion through Redundancy Removal and
Performance Optimization [54.113083217869516]
In this work, we first explore the computational redundancy part of the network.
We then prune the redundancy blocks of the model and maintain the network performance.
Thirdly, we propose a global-regional interactive (GRI) attention to speed up the computationally intensive attention part.
arXiv Detail & Related papers (2023-12-24T15:37:47Z) - On Compressing Sequences for Self-Supervised Speech Models [78.62210521316081]
We study fixed-length and variable-length subsampling along the time axis in self-supervised learning.
We find that variable-length subsampling performs particularly well under low frame rates.
If we have access to phonetic boundaries, we find no degradation in performance for an average frame rate as low as 10 Hz.
arXiv Detail & Related papers (2022-10-13T17:10:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.