Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
- URL: http://arxiv.org/abs/2506.08009v1
- Date: Mon, 09 Jun 2025 17:59:55 GMT
- Title: Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
- Authors: Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, Eli Shechtman,
- Abstract summary: We introduce Self Forcing, a novel training paradigm for autoregressive video diffusion models.<n>It addresses the longstanding issue of exposure bias, where models trained on ground-truth context must generate sequences conditioned on their own imperfect outputs.
- Score: 70.4360995984905
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We introduce Self Forcing, a novel training paradigm for autoregressive video diffusion models. It addresses the longstanding issue of exposure bias, where models trained on ground-truth context must generate sequences conditioned on their own imperfect outputs during inference. Unlike prior methods that denoise future frames based on ground-truth context frames, Self Forcing conditions each frame's generation on previously self-generated outputs by performing autoregressive rollout with key-value (KV) caching during training. This strategy enables supervision through a holistic loss at the video level that directly evaluates the quality of the entire generated sequence, rather than relying solely on traditional frame-wise objectives. To ensure training efficiency, we employ a few-step diffusion model along with a stochastic gradient truncation strategy, effectively balancing computational cost and performance. We further introduce a rolling KV cache mechanism that enables efficient autoregressive video extrapolation. Extensive experiments demonstrate that our approach achieves real-time streaming video generation with sub-second latency on a single GPU, while matching or even surpassing the generation quality of significantly slower and non-causal diffusion models. Project website: http://self-forcing.github.io/
Related papers
- Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation [35.96336774165454]
Existing large-scale video generation models are computationally intensive.<n>We propose autoregressive adversarial post-training to transform a pre-trained latent video diffusion model into a real-time, interactive video generator.<n>Our model autoregressively generates a latent frame at a time using a single neural function evaluation.
arXiv Detail & Related papers (2025-06-11T03:04:23Z) - Generative Pre-trained Autoregressive Diffusion Transformer [54.476056835275415]
GPDiT is a Generative Pre-trained Autoregressive Diffusion Transformer.<n>It unifies the strengths of diffusion and autoregressive modeling for long-range video synthesis.<n>It autoregressively predicts future latent frames using a diffusion loss, enabling natural modeling of motion dynamics.
arXiv Detail & Related papers (2025-05-12T08:32:39Z) - Rethinking Video Tokenization: A Conditioned Diffusion-based Approach [58.164354605550194]
New tokenizer, Diffusion Conditioned-based Gene Tokenizer, replaces GAN-based decoder with conditional diffusion model.<n>We trained using only a basic MSE diffusion loss for reconstruction, along with KL term and LPIPS perceptual loss from scratch.<n>Even a scaled-down version of CDT (3$times inference speedup) still performs comparably with top baselines.
arXiv Detail & Related papers (2025-03-05T17:59:19Z) - Autoregressive Video Generation without Vector Quantization [90.87907377618747]
We reformulate the video generation problem as a non-quantized autoregressive modeling of temporal frame-by-frame prediction.<n>With the proposed approach, we train a novel video autoregressive model without vector quantization, termed NOVA.<n>Our results demonstrate that NOVA surpasses prior autoregressive video models in data efficiency, inference speed, visual fidelity, and video fluency, even with a much smaller model capacity.
arXiv Detail & Related papers (2024-12-18T18:59:53Z) - From Slow Bidirectional to Fast Autoregressive Video Diffusion Models [52.32078428442281]
Current video diffusion models achieve impressive generation quality but struggle in interactive applications due to bidirectional attention dependencies.<n>We address this limitation by adapting a pretrained bidirectional diffusion transformer to an autoregressive transformer that generates frames on-the-fly.<n>Our model achieves a total score of 84.27 on the VBench-Long benchmark, surpassing all previous video generation models.
arXiv Detail & Related papers (2024-12-10T18:59:50Z) - TimeDART: A Diffusion Autoregressive Transformer for Self-Supervised Time Series Representation [6.047856576139978]
We propose textbfTimeDART, a novel self-supervised time series pre-training framework.<n>TimeDART unifies two powerful generative paradigms to learn more transferable representations.<n>We conduct extensive experiments on public datasets for time series forecasting and classification.
arXiv Detail & Related papers (2024-10-08T06:08:33Z) - On Compressing Sequences for Self-Supervised Speech Models [78.62210521316081]
We study fixed-length and variable-length subsampling along the time axis in self-supervised learning.
We find that variable-length subsampling performs particularly well under low frame rates.
If we have access to phonetic boundaries, we find no degradation in performance for an average frame rate as low as 10 Hz.
arXiv Detail & Related papers (2022-10-13T17:10:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.