StableWorld: Towards Stable and Consistent Long Interactive Video Generation
- URL: http://arxiv.org/abs/2601.15281v1
- Date: Wed, 21 Jan 2026 18:59:02 GMT
- Title: StableWorld: Towards Stable and Consistent Long Interactive Video Generation
- Authors: Ying Yang, Zhengyao Lv, Tianlin Pan, Haofan Wang, Binxin Yang, Hubery Yin, Chen Li, Ziwei Liu, Chenyang Si,
- Abstract summary: We investigate the overlooked challenge of stability and temporal consistency in interactive video generation.<n>We propose a simple yet effective method, textbfStableWorld, a Dynamic Frame Eviction Mechanism.<n>StableWorld effectively prevents cumulative drift at its source, leading to more stable and temporal consistency of interactive generation.
- Score: 45.597087309159456
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we explore the overlooked challenge of stability and temporal consistency in interactive video generation, which synthesizes dynamic and controllable video worlds through interactive behaviors such as camera movements and text prompts. Despite remarkable progress in world modeling, current methods still suffer from severe instability and temporal degradation, often leading to spatial drift and scene collapse during long-horizon interactions. To better understand this issue, we initially investigate the underlying causes of instability and identify that the major source of error accumulation originates from the same scene, where generated frames gradually deviate from the initial clean state and propagate errors to subsequent frames. Building upon this observation, we propose a simple yet effective method, \textbf{StableWorld}, a Dynamic Frame Eviction Mechanism. By continuously filtering out degraded frames while retaining geometrically consistent ones, StableWorld effectively prevents cumulative drift at its source, leading to more stable and temporal consistency of interactive generation. Promising results on multiple interactive video models, \eg, Matrix-Game, Open-Oasis, and Hunyuan-GameCraft, demonstrate that StableWorld is model-agnostic and can be applied to different interactive video generation frameworks to substantially improve stability, temporal consistency, and generalization across diverse interactive scenarios.
Related papers
- AGILE: Hand-Object Interaction Reconstruction from Video via Agentic Generation [45.753757870577196]
We introduce AGILE, a robust framework that shifts the paradigm from reconstruction to agentic generation for interaction learning.<n>We show that AGILE outperforms baselines in global geometric accuracy while demonstrating exceptional robustness on challenging sequences where prior art frequently collapses.
arXiv Detail & Related papers (2026-02-04T15:42:58Z) - LoL: Longer than Longer, Scaling Video Generation to Hour [50.945885467651216]
This work achieves the first demonstration of real-time, streaming, and infinite-length video generation with little quality decay.<n>As an illustration, we generate continuous videos up to 12 hours in length, which, to our knowledge, is among the longest publicly demonstrated results in streaming video generation.
arXiv Detail & Related papers (2026-01-23T17:21:35Z) - FlowAct-R1: Towards Interactive Humanoid Video Generation [37.04996721172613]
FlowAct-R1 is a framework specifically designed for real-time interactive humanoid video generation.<n>Our framework achieves a stable 25 fps at 480p resolution with a time-to-first-frame (F) of only around 1.5 seconds.
arXiv Detail & Related papers (2026-01-15T06:16:22Z) - TeleWorld: Towards Dynamic Multimodal Synthesis with a 4D World Model [53.555353366322464]
We present TeleWorld, a real-time multimodal 4D world modeling framework that unifies video generation, dynamic scene reconstruction, and long-term world memory within a closed-loop system.<n>Our approach achieves seamless integration of dynamic object modeling and static scene representation within a unified 4D framework, advancing world models toward practical, interactive, and computationally accessible synthesis systems.
arXiv Detail & Related papers (2025-12-31T18:31:46Z) - Knot Forcing: Taming Autoregressive Video Diffusion Models for Real-time Infinite Interactive Portrait Animation [16.692450893925148]
We present a novel streaming framework named Knot Forcing for real-time portrait animation.<n>K Knot Forcing enables high-fidelity, temporally consistent, and interactive portrait animation over infinite sequences.
arXiv Detail & Related papers (2025-12-25T16:34:56Z) - LyTimeT: Towards Robust and Interpretable State-Variable Discovery [7.505092370141079]
LyTimeT is a framework for interpretable variable extraction.<n>It learns robust and stable latent representations of dynamical video systems.<n>Our results demonstrate that combiningtemporal attention with stability constraints yields predictive models.
arXiv Detail & Related papers (2025-10-22T16:03:10Z) - Towards Robust and Generalizable Continuous Space-Time Video Super-Resolution with Events [71.2439653098351]
Continuous space-time video super-STVSR has garnered increasing interest for its capability to reconstruct high-resolution and high-frame-rate videos at arbitrary temporal scales.<n>We present EvEnhancer, a novel approach that marries unique properties of high temporal and high dynamic range encapsulated in event streams.<n>Our method achieves state-of-the-art performance on both synthetic and real-world datasets, while maintaining generalizability at OOD scales.
arXiv Detail & Related papers (2025-10-04T15:23:07Z) - Occlusion-Aware Temporally Consistent Amodal Completion for 3D Human-Object Interaction Reconstruction [29.807994561746437]
We introduce a novel framework for reconstructing dynamic human-object interactions from monocular video.<n>Our method integrates temporal context, enforcing coherence across video sequences to incrementally refine and stabilize reconstructions.<n>We validate our approach using 3D Gaussian Splatting on challenging monocular videos.
arXiv Detail & Related papers (2025-07-10T19:56:10Z) - InterDyn: Controllable Interactive Dynamics with Video Diffusion Models [50.38647583839384]
We propose InterDyn, a framework that generates videos of interactive dynamics given an initial frame and a control signal encoding the motion of a driving object or actor.<n>Our key insight is that large video generation models can act as both neurals and implicit physics simulators'', having learned interactive dynamics from large-scale video data.
arXiv Detail & Related papers (2024-12-16T13:57:02Z) - Intrinsic Temporal Regularization for High-resolution Human Video
Synthesis [59.54483950973432]
temporal consistency is crucial for extending image processing pipelines to the video domain.
We propose an effective intrinsic temporal regularization scheme, where an intrinsic confidence map is estimated via the frame generator to regulate motion estimation.
We apply our intrinsic temporal regulation to single-image generator, leading to a powerful " INTERnet" capable of generating $512times512$ resolution human action videos.
arXiv Detail & Related papers (2020-12-11T05:29:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.