TeleWorld: Towards Dynamic Multimodal Synthesis with a 4D World Model
- URL: http://arxiv.org/abs/2601.00051v1
- Date: Wed, 31 Dec 2025 18:31:46 GMT
- Title: TeleWorld: Towards Dynamic Multimodal Synthesis with a 4D World Model
- Authors: Yabo Chen, Yuanzhi Liang, Jiepeng Wang, Tingxi Chen, Junfei Cheng, Zixiao Gu, Yuyang Huang, Zicheng Jiang, Wei Li, Tian Li, Weichen Li, Zuoxin Li, Guangce Liu, Jialun Liu, Junqi Liu, Haoyuan Wang, Qizhen Weng, Xuan'er Wu, Xunzhi Xiang, Xiaoyan Yang, Xin Zhang, Shiwen Zhang, Junyu Zhou, Chengcheng Zhou, Haibin Huang, Chi Zhang, Xuelong Li,
- Abstract summary: We present TeleWorld, a real-time multimodal 4D world modeling framework that unifies video generation, dynamic scene reconstruction, and long-term world memory within a closed-loop system.<n>Our approach achieves seamless integration of dynamic object modeling and static scene representation within a unified 4D framework, advancing world models toward practical, interactive, and computationally accessible synthesis systems.
- Score: 53.555353366322464
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: World models aim to endow AI systems with the ability to represent, generate, and interact with dynamic environments in a coherent and temporally consistent manner. While recent video generation models have demonstrated impressive visual quality, they remain limited in real-time interaction, long-horizon consistency, and persistent memory of dynamic scenes, hindering their evolution into practical world models. In this report, we present TeleWorld, a real-time multimodal 4D world modeling framework that unifies video generation, dynamic scene reconstruction, and long-term world memory within a closed-loop system. TeleWorld introduces a novel generation-reconstruction-guidance paradigm, where generated video streams are continuously reconstructed into a dynamic 4D spatio-temporal representation, which in turn guides subsequent generation to maintain spatial, temporal, and physical consistency. To support long-horizon generation with low latency, we employ an autoregressive diffusion-based video model enhanced with Macro-from-Micro Planning (MMPL)--a hierarchical planning method that reduces error accumulation from frame-level to segment-level-alongside efficient Distribution Matching Distillation (DMD), enabling real-time synthesis under practical computational budgets. Our approach achieves seamless integration of dynamic object modeling and static scene representation within a unified 4D framework, advancing world models toward practical, interactive, and computationally accessible systems. Extensive experiments demonstrate that TeleWorld achieves strong performance in both static and dynamic world understanding, long-term consistency, and real-time generation efficiency, positioning it as a practical step toward interactive, memory-enabled world models for multimodal generation and embodied intelligence.
Related papers
- RAYNOVA: Scale-Temporal Autoregressive World Modeling in Ray Space [51.441415833480505]
RAYNOVA is a multiview world model for driving scenarios that employs a dual-causal autoregressive framework.<n>It constructs an isotropic-temporal representation across views, frames, and scales based on relative Plcker-ray positional encoding.
arXiv Detail & Related papers (2026-02-24T08:41:40Z) - DynamicVerse: A Physically-Aware Multimodal Framework for 4D World Modeling [67.95038177144554]
We introduce DynamicVerse, a physical-scale, multimodal 4D world modeling framework for dynamic real-world video.<n>We employ vision, geometric, and multimodal models to interpret metric-scale static geometry, real-world dynamic motion, instance-level masks, and holistic captions.<n> DynamicVerse delivers a large-scale dataset consisting of 100K+ videos with 800K+ annotated masks and 10M+ frames from internet videos.
arXiv Detail & Related papers (2025-12-02T18:24:27Z) - IC-World: In-Context Generation for Shared World Modeling [61.69655562995357]
Video-based world models have recently garnered increasing attention for their ability to synthesize diverse and dynamic visual environments.<n>In this paper, we focus on shared world modeling, where a model generates multiple videos from a set of input images, each representing the same underlying world in different camera poses.<n>We propose IC-World, a novel generation framework, enabling parallel generation for all input images.
arXiv Detail & Related papers (2025-12-01T16:52:02Z) - Any4D: Open-Prompt 4D Generation from Natural Language and Images [7.541641344819342]
We propose bfPrimitive Embodied World Models (PEWM), which restricts video generation to shorter horizons.<n>Our framework bridges the gap between fine-grained physical interaction and high-level reasoning, paving the way toward scalable, interpretable, and general-purpose embodied intelligence.
arXiv Detail & Related papers (2025-11-24T04:17:26Z) - SAMPO:Scale-wise Autoregression with Motion PrOmpt for generative world models [42.814012901180774]
textbfSAMPO is a hybrid framework that combines visual autoregressive modeling for intra-frame generation with causal modeling for next-frame generation.<n>We show that SAMPO achieves competitive performance in action-conditioned video prediction and model-based control.<n>We also evaluate SAMPO's zero-shot generalization and scaling behavior, demonstrating its ability to generalize to unseen tasks.
arXiv Detail & Related papers (2025-09-19T02:41:37Z) - OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling [86.12242953301121]
We introduce OmniWorld, a large-scale, multi-domain, multi-modal dataset specifically designed for 4D world modeling.<n>Compared with existing synthetic datasets, OmniWorld-Game provides richer modality coverage, larger scale, and more realistic dynamic interactions.<n>We establish a challenging benchmark that exposes the limitations of current state-of-the-art (SOTA) approaches in modeling complex 4D environments.
arXiv Detail & Related papers (2025-09-15T17:59:19Z) - Learning Primitive Embodied World Models: Towards Scalable Robotic Learning [50.32986780156215]
We propose a novel paradigm for world modeling--Primitive Embodied World Models (PEWM)<n>By restricting video generation to fixed short horizons, our approach enables fine-grained alignment between linguistic concepts and visual representations of robotic actions.<n>Our framework bridges the gap between fine-grained physical interaction and high-level reasoning, paving the way toward scalable, interpretable, and general-purpose embodied intelligence.
arXiv Detail & Related papers (2025-08-28T14:31:48Z) - Learning World Models for Interactive Video Generation [20.793871778030113]
We enhance image-to-video models with interactive capabilities through additional action conditioning and autoregressive framework.<n>We propose video retrieval augmented generation (VRAG) with explicit global state conditioning, which significantly reduces long-term compounding errors.<n>Our work establishes a comprehensive benchmark for improving video generation models with internal world modeling capabilities.
arXiv Detail & Related papers (2025-05-28T05:55:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.