Infinite-World: Scaling Interactive World Models to 1000-Frame Horizons via Pose-Free Hierarchical Memory
- URL: http://arxiv.org/abs/2602.02393v2
- Date: Tue, 03 Feb 2026 15:06:38 GMT
- Title: Infinite-World: Scaling Interactive World Models to 1000-Frame Horizons via Pose-Free Hierarchical Memory
- Authors: Ruiqi Wu, Xuanhua He, Meng Cheng, Tianyu Yang, Yong Zhang, Zhuoliang Kang, Xunliang Cai, Xiaoming Wei, Chunle Guo, Chongyi Li, Ming-Ming Cheng,
- Abstract summary: We propose a robust interactive world model capable of maintaining coherent visual memory over 1000+ frames in complex real-world environments.<n>We introduce a Pose-free Memory (HPMC) that distills historical latents into a fixed-budget geometric representation.<n>We also propose an Uncertainty-aware Action Labeling module that discretizes continuous motion into a tri-state logic.
- Score: 101.2076718776139
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose Infinite-World, a robust interactive world model capable of maintaining coherent visual memory over 1000+ frames in complex real-world environments. While existing world models can be efficiently optimized on synthetic data with perfect ground-truth, they lack an effective training paradigm for real-world videos due to noisy pose estimations and the scarcity of viewpoint revisits. To bridge this gap, we first introduce a Hierarchical Pose-free Memory Compressor (HPMC) that recursively distills historical latents into a fixed-budget representation. By jointly optimizing the compressor with the generative backbone, HPMC enables the model to autonomously anchor generations in the distant past with bounded computational cost, eliminating the need for explicit geometric priors. Second, we propose an Uncertainty-aware Action Labeling module that discretizes continuous motion into a tri-state logic. This strategy maximizes the utilization of raw video data while shielding the deterministic action space from being corrupted by noisy trajectories, ensuring robust action-response learning. Furthermore, guided by insights from a pilot toy study, we employ a Revisit-Dense Finetuning Strategy using a compact, 30-minute dataset to efficiently activate the model's long-range loop-closure capabilities. Extensive experiments, including objective metrics and user studies, demonstrate that Infinite-World achieves superior performance in visual quality, action controllability, and spatial consistency.
Related papers
- Unifying Model-Free Efficiency and Model-Based Representations via Latent Dynamics [6.208369829942616]
We present Unified Latent Dynamics (ULD), a novel reinforcement learning algorithm.<n>ULD unifies the efficiency of model-free methods with the representational strengths of model-based approaches.<n> evaluated on 80 environments spanning Gym locomotion, DeepMind Control (proprioceptive and visual), and Atari.
arXiv Detail & Related papers (2026-02-13T06:06:56Z) - Causal World Modeling for Robot Control [56.31803788587547]
Video world models provide the ability to imagine the near future by understanding the causality between actions and visual dynamics.<n>We introduce LingBot-VA, an autoregressive diffusion framework that learns frame prediction and policy execution simultaneously.<n>We evaluate our model on both simulation benchmarks and real-world scenarios, where it shows significant promise in long-horizon manipulation, data efficiency in post-training, and strong generalizability to novel configurations.
arXiv Detail & Related papers (2026-01-29T17:07:43Z) - RELIC: Interactive Video World Model with Long-Horizon Memory [74.81433479334821]
A truly interactive world model requires real-time long-horizon streaming, consistent spatial memory, and precise user control.<n>We present RELIC, a unified framework that tackles these three challenges altogether.<n>Given a single image and a text description, RELIC enables memory-aware, long-duration exploration of arbitrary scenes in real time.
arXiv Detail & Related papers (2025-12-03T18:29:20Z) - Scaling Up Occupancy-centric Driving Scene Generation: Dataset and Method [54.461213497603154]
Occupancy-centric methods have recently achieved state-of-the-art results by offering consistent conditioning across frames and modalities.<n>Nuplan-Occ is the largest occupancy dataset to date, constructed from the widely used Nuplan benchmark.<n>We develop a unified framework that jointly synthesizes high-quality occupancy, multi-view videos, and LiDAR point clouds.
arXiv Detail & Related papers (2025-10-27T03:52:45Z) - ARTDECO: Towards Efficient and High-Fidelity On-the-Fly 3D Reconstruction with Structured Scene Representation [44.75113949778924]
ARTDECO is a unified framework that combines the efficiency of feed-forward models with the reliability of SLAM-based pipelines.<n>We show that ARTDECO delivers interactive performance comparable to SLAM, robustness similar to feed-forward systems, and reconstruction quality close to per-scene optimization.
arXiv Detail & Related papers (2025-10-09T17:57:38Z) - SAMPO:Scale-wise Autoregression with Motion PrOmpt for generative world models [42.814012901180774]
textbfSAMPO is a hybrid framework that combines visual autoregressive modeling for intra-frame generation with causal modeling for next-frame generation.<n>We show that SAMPO achieves competitive performance in action-conditioned video prediction and model-based control.<n>We also evaluate SAMPO's zero-shot generalization and scaling behavior, demonstrating its ability to generalize to unseen tasks.
arXiv Detail & Related papers (2025-09-19T02:41:37Z) - EDELINE: Enhancing Memory in Diffusion-based World Models via Linear-Time Sequence Modeling [8.250616459360684]
We introduce EDELINE, a unified world model architecture that integrates state space models with diffusion models.<n>Our approach outperforms existing baselines across visually challenging Atari 100k tasks, memory-demanding benchmark, and 3D first-person ViZDoom environments.
arXiv Detail & Related papers (2025-02-01T15:49:59Z) - RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane Networks [93.18404922542702]
We present a novel video generative model designed to address long-term spatial and temporal dependencies.
Our approach incorporates a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks.
Our model synthesizes high-fidelity video clips at a resolution of $256times256$ pixels, with durations extending to more than $5$ seconds at a frame rate of 30 fps.
arXiv Detail & Related papers (2024-01-11T16:48:44Z) - STORM: Efficient Stochastic Transformer based World Models for
Reinforcement Learning [82.03481509373037]
Recently, model-based reinforcement learning algorithms have demonstrated remarkable efficacy in visual input environments.
We introduce Transformer-based wORld Model (STORM), an efficient world model architecture that combines strong modeling and generation capabilities.
Storm achieves a mean human performance of $126.7%$ on the Atari $100$k benchmark, setting a new record among state-of-the-art methods.
arXiv Detail & Related papers (2023-10-14T16:42:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.