LIVE: Long-horizon Interactive Video World Modeling
- URL: http://arxiv.org/abs/2602.03747v1
- Date: Tue, 03 Feb 2026 17:10:03 GMT
- Title: LIVE: Long-horizon Interactive Video World Modeling
- Authors: Junchao Huang, Ziyang Ye, Xinting Hu, Tianyu He, Guiyu Zhang, Shaoshuai Shi, Jiang Bian, Li Jiang,
- Abstract summary: Long-horizon Interactive Video world modEl enforces bounded error accumulation via a novel cycle-consistency objective.<n>Live achieves state-of-the-art performance on long-horizon benchmarks, generating stable, high-quality videos far beyond training rollout lengths.
- Score: 39.52605866460851
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Autoregressive video world models predict future visual observations conditioned on actions. While effective over short horizons, these models often struggle with long-horizon generation, as small prediction errors accumulate over time. Prior methods alleviate this by introducing pre-trained teacher models and sequence-level distribution matching, which incur additional computational cost and fail to prevent error propagation beyond the training horizon. In this work, we propose LIVE, a Long-horizon Interactive Video world modEl that enforces bounded error accumulation via a novel cycle-consistency objective, thereby eliminating the need for teacher-based distillation. Specifically, LIVE first performs a forward rollout from ground-truth frames and then applies a reverse generation process to reconstruct the initial state. The diffusion loss is subsequently computed on the reconstructed terminal state, providing an explicit constraint on long-horizon error propagation. Moreover, we provide an unified view that encompasses different approaches and introduce progressive training curriculum to stabilize training. Experiments demonstrate that LIVE achieves state-of-the-art performance on long-horizon benchmarks, generating stable, high-quality videos far beyond training rollout lengths.
Related papers
- LoL: Longer than Longer, Scaling Video Generation to Hour [50.945885467651216]
This work achieves the first demonstration of real-time, streaming, and infinite-length video generation with little quality decay.<n>As an illustration, we generate continuous videos up to 12 hours in length, which, to our knowledge, is among the longest publicly demonstrated results in streaming video generation.
arXiv Detail & Related papers (2026-01-23T17:21:35Z) - End-to-End Training for Autoregressive Video Diffusion via Self-Resampling [63.84672807009907]
Autoregressive video diffusion models hold promise for world simulation but are vulnerable to exposure bias arising from the train-test mismatch.<n>We introduce Resampling Forcing, a teacher-free framework that enables training autoregressive video models from scratch and at scale.
arXiv Detail & Related papers (2025-12-17T18:53:29Z) - BAgger: Backwards Aggregation for Mitigating Drift in Autoregressive Video Diffusion Models [50.986189632485285]
We introduce Backwards Aggregation (BAgger), a self-supervised scheme that constructs corrective trajectories from the model's own rollouts.<n>Unlike prior approaches that rely on few-step distillation and distribution-matching losses, BAgger trains with standard score or flow matching objectives.<n>We instantiate BAgger on causal diffusion transformers and evaluate on text-to-video, video extension, and multi-prompt generation.
arXiv Detail & Related papers (2025-12-12T23:02:02Z) - Self-Forcing++: Towards Minute-Scale High-Quality Video Generation [50.945885467651216]
Diffusion models have revolutionized image and video generation, achieving unprecedented visual quality.<n>Recent work has explored autoregressive formulations for long video generation, typically by distilling from short-horizon bidirectional teachers.<n>We propose a simple yet effective approach to mitigate quality degradation in long-horizon video generation without requiring supervision from long-video teachers or retraining on long video datasets.
arXiv Detail & Related papers (2025-10-02T17:55:42Z) - STAGE: A Stream-Centric Generative World Model for Long-Horizon Driving-Scene Simulation [42.73124501421074]
STAGE is an auto-regressive framework that pioneers hierarchical feature coordination and multiphase optimization for sustainable video synthesis.<n>HTFT enhances temporal consistency between video frames throughout the video generation process.<n>We generated 600 frames of high-quality driving videos on the Nuscenes dataset, which far exceeds the maximum length achievable by existing methods.
arXiv Detail & Related papers (2025-06-16T06:53:05Z) - Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion [67.94300151774085]
We introduce Self Forcing, a novel training paradigm for autoregressive video diffusion models.<n>It addresses the longstanding issue of exposure bias, where models trained on ground-truth context must generate sequences conditioned on their own imperfect outputs.
arXiv Detail & Related papers (2025-06-09T17:59:55Z) - LongDWM: Cross-Granularity Distillation for Building a Long-Term Driving World Model [22.92353994818742]
Driving world models are used to simulate futures by video generation based on the condition of the current state and actions.<n>Recent studies utilize the Diffusion Transformer (DiT) as the backbone of driving world models to improve learning flexibility.<n>We propose several solutions to build a simple yet effective long-term driving world model.
arXiv Detail & Related papers (2025-06-02T11:19:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.