Self-Forcing++: Towards Minute-Scale High-Quality Video Generation
- URL: http://arxiv.org/abs/2510.02283v1
- Date: Thu, 02 Oct 2025 17:55:42 GMT
- Title: Self-Forcing++: Towards Minute-Scale High-Quality Video Generation
- Authors: Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, Cho-Jui Hsieh,
- Abstract summary: Diffusion models have revolutionized image and video generation, achieving unprecedented visual quality.<n>Recent work has explored autoregressive formulations for long video generation, typically by distilling from short-horizon bidirectional teachers.<n>We propose a simple yet effective approach to mitigate quality degradation in long-horizon video generation without requiring supervision from long-video teachers or retraining on long video datasets.
- Score: 50.945885467651216
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Diffusion models have revolutionized image and video generation, achieving unprecedented visual quality. However, their reliance on transformer architectures incurs prohibitively high computational costs, particularly when extending generation to long videos. Recent work has explored autoregressive formulations for long video generation, typically by distilling from short-horizon bidirectional teachers. Nevertheless, given that teacher models cannot synthesize long videos, the extrapolation of student models beyond their training horizon often leads to pronounced quality degradation, arising from the compounding of errors within the continuous latent space. In this paper, we propose a simple yet effective approach to mitigate quality degradation in long-horizon video generation without requiring supervision from long-video teachers or retraining on long video datasets. Our approach centers on exploiting the rich knowledge of teacher models to provide guidance for the student model through sampled segments drawn from self-generated long videos. Our method maintains temporal consistency while scaling video length by up to 20x beyond teacher's capability, avoiding common issues such as over-exposure and error-accumulation without recomputing overlapping frames like previous methods. When scaling up the computation, our method shows the capability of generating videos up to 4 minutes and 15 seconds, equivalent to 99.9% of the maximum span supported by our base model's position embedding and more than 50x longer than that of our baseline model. Experiments on standard benchmarks and our proposed improved benchmark demonstrate that our approach substantially outperforms baseline methods in both fidelity and consistency. Our long-horizon videos demo can be found at https://self-forcing-plus-plus.github.io/
Related papers
- LIVE: Long-horizon Interactive Video World Modeling [39.52605866460851]
Long-horizon Interactive Video world modEl enforces bounded error accumulation via a novel cycle-consistency objective.<n>Live achieves state-of-the-art performance on long-horizon benchmarks, generating stable, high-quality videos far beyond training rollout lengths.
arXiv Detail & Related papers (2026-02-03T17:10:03Z) - End-to-End Training for Autoregressive Video Diffusion via Self-Resampling [63.84672807009907]
Autoregressive video diffusion models hold promise for world simulation but are vulnerable to exposure bias arising from the train-test mismatch.<n>We introduce Resampling Forcing, a teacher-free framework that enables training autoregressive video models from scratch and at scale.
arXiv Detail & Related papers (2025-12-17T18:53:29Z) - LongDWM: Cross-Granularity Distillation for Building a Long-Term Driving World Model [22.92353994818742]
Driving world models are used to simulate futures by video generation based on the condition of the current state and actions.<n>Recent studies utilize the Diffusion Transformer (DiT) as the backbone of driving world models to improve learning flexibility.<n>We propose several solutions to build a simple yet effective long-term driving world model.
arXiv Detail & Related papers (2025-06-02T11:19:23Z) - VideoMerge: Towards Training-free Long Video Generation [46.108622251662176]
Long video generation remains a challenging and compelling topic in computer vision.<n>We propose VideoMerge, a training-free method that can be seamlessly adapted to merge short videos.
arXiv Detail & Related papers (2025-03-13T00:47:59Z) - Time Does Tell: Self-Supervised Time-Tuning of Dense Image
Representations [79.87044240860466]
We propose a novel approach that incorporates temporal consistency in dense self-supervised learning.
Our approach, which we call time-tuning, starts from image-pretrained models and fine-tunes them with a novel self-supervised temporal-alignment clustering loss on unlabeled videos.
Time-tuning improves the state-of-the-art by 8-10% for unsupervised semantic segmentation on videos and matches it for images.
arXiv Detail & Related papers (2023-08-22T21:28:58Z) - Video Generation Beyond a Single Clip [76.5306434379088]
Video generation models can only generate video clips that are relatively short compared with the length of real videos.
To generate long videos covering diverse content and multiple events, we propose to use additional guidance to control the video generation process.
The proposed approach is complementary to existing efforts on video generation, which focus on generating realistic video within a fixed time window.
arXiv Detail & Related papers (2023-04-15T06:17:30Z) - Latent Video Diffusion Models for High-Fidelity Long Video Generation [58.346702410885236]
We introduce lightweight video diffusion models using a low-dimensional 3D latent space.
We also propose hierarchical diffusion in the latent space such that longer videos with more than one thousand frames can be produced.
Our framework generates more realistic and longer videos than previous strong baselines.
arXiv Detail & Related papers (2022-11-23T18:58:39Z) - Temporally Consistent Transformers for Video Generation [80.45230642225913]
To generate accurate videos, algorithms have to understand the spatial and temporal dependencies in the world.
No established benchmarks on complex data exist for rigorously evaluating video generation with long temporal dependencies.
We introduce the Temporally Consistent Transformer (TECO), a generative model that substantially improves long-term consistency while also reducing sampling time.
arXiv Detail & Related papers (2022-10-05T17:15:10Z) - Generating Long Videos of Dynamic Scenes [66.56925105992472]
We present a video generation model that reproduces object motion, changes in camera viewpoint, and new content that arises over time.
A common failure case is for content to never change due to over-reliance on inductive biases to provide temporal consistency.
arXiv Detail & Related papers (2022-06-07T16:29:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.