World-in-World: World Models in a Closed-Loop World
- URL: http://arxiv.org/abs/2510.18135v1
- Date: Mon, 20 Oct 2025 22:09:15 GMT
- Title: World-in-World: World Models in a Closed-Loop World
- Authors: Jiahan Zhang, Muqing Jiang, Nanru Dai, Taiming Lu, Arda Uzunoglu, Shunchi Zhang, Yana Wei, Jiahao Wang, Vishal M. Patel, Paul Pu Liang, Daniel Khashabi, Cheng Peng, Rama Chellappa, Tianmin Shu, Alan Yuille, Yilun Du, Jieneng Chen,
- Abstract summary: We introduce World-in-World, the first open platform that benchmarks world models in a closed-loop world that mirrors real agent-environment interactions.<n>We curate four closed-loop environments that rigorously evaluate diverse WMs, prioritize task success as the primary metric, and move beyond the common focus on visual quality.<n>Our study uncovers three surprises: (1) visual quality alone does not guarantee task success, controllability matters more; (2) scaling post-training with action-observation data is more effective than upgrading the pretrained video generators; and (3) allocating more inference-time compute allows WMs to substantially improve closed
- Score: 123.85805788728128
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generative world models (WMs) can now simulate worlds with striking visual realism, which naturally raises the question of whether they can endow embodied agents with predictive perception for decision making. Progress on this question has been limited by fragmented evaluation: most existing benchmarks adopt open-loop protocols that emphasize visual quality in isolation, leaving the core issue of embodied utility unresolved, i.e., do WMs actually help agents succeed at embodied tasks? To address this gap, we introduce World-in-World, the first open platform that benchmarks WMs in a closed-loop world that mirrors real agent-environment interactions. World-in-World provides a unified online planning strategy and a standardized action API, enabling heterogeneous WMs for decision making. We curate four closed-loop environments that rigorously evaluate diverse WMs, prioritize task success as the primary metric, and move beyond the common focus on visual quality; we also present the first data scaling law for world models in embodied settings. Our study uncovers three surprises: (1) visual quality alone does not guarantee task success, controllability matters more; (2) scaling post-training with action-observation data is more effective than upgrading the pretrained video generators; and (3) allocating more inference-time compute allows WMs to substantially improve closed-loop performance.
Related papers
- WorldArena: A Unified Benchmark for Evaluating Perception and Functional Utility of Embodied World Models [114.95269118652163]
We introduce WorldArena, a unified benchmark designed to evaluate embodied world models across both perceptual and functional dimensions.<n>WorldArena assesses models through three dimensions: video perception quality, measured with 16 metrics across six sub-dimensions; embodied task functionality, which evaluates world models as data engines, policy evaluators, and action planners integrating with subjective human evaluation.<n>Through extensive experiments on 14 representative models, we reveal a significant perception-functionality gap, showing that high visual quality does not necessarily translate into strong embodied task capability.
arXiv Detail & Related papers (2026-02-09T18:09:20Z) - A Comprehensive Survey on World Models for Embodied AI [14.457261562275121]
Embodied AI requires agents that perceive, act, and anticipate how actions reshape future world states.<n>This survey presents a unified framework for world models in embodied AI.
arXiv Detail & Related papers (2025-10-19T07:12:32Z) - Learning Primitive Embodied World Models: Towards Scalable Robotic Learning [50.32986780156215]
We propose a novel paradigm for world modeling--Primitive Embodied World Models (PEWM)<n>By restricting video generation to fixed short horizons, our approach enables fine-grained alignment between linguistic concepts and visual representations of robotic actions.<n>Our framework bridges the gap between fine-grained physical interaction and high-level reasoning, paving the way toward scalable, interpretable, and general-purpose embodied intelligence.
arXiv Detail & Related papers (2025-08-28T14:31:48Z) - FOUNDER: Grounding Foundation Models in World Models for Open-Ended Embodied Decision Making [32.050134958163184]
Foundation Models (FMs) and World Models (WMs) offer complementary strengths in task generalization at different levels.<n>We propose FOUNDER, a framework that integrates the generalizable knowledge embedded in FMs with the dynamic modeling capabilities of WMs.<n>We learn a mapping function that grounds FM representations in the WM state space, effectively inferring the agent's physical states in the world simulator from external observations.
arXiv Detail & Related papers (2025-07-15T21:49:49Z) - Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation [54.3628937181904]
Internal world models (WMs) enable agents to understand the world's state and predict transitions.<n>Recent large Vision-Language Models (VLMs), such as OpenAI o3, GPT-4o and Gemini, exhibit potential as general-purpose WMs.
arXiv Detail & Related papers (2025-06-27T03:24:29Z) - WorldPrediction: A Benchmark for High-level World Modeling and Long-horizon Procedural Planning [52.36434784963598]
We introduce WorldPrediction, a video-based benchmark for evaluating world modeling and procedural planning capabilities of different AI models.<n>We show that current frontier models barely achieve 57% accuracy on WorldPrediction-WM and 38% on WorldPrediction-PP whereas humans are able to solve both tasks perfectly.
arXiv Detail & Related papers (2025-06-04T18:22:40Z) - Unlocking Smarter Device Control: Foresighted Planning with a World Model-Driven Code Execution Approach [82.27842884709378]
We propose a framework that prioritizes natural language understanding and structured reasoning to enhance the agent's global understanding of the environment.<n>Our method outperforms previous approaches, particularly achieving a 44.4% relative improvement in task success rate.
arXiv Detail & Related papers (2025-05-22T09:08:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.