WoW: Towards a World omniscient World model Through Embodied Interaction
- URL: http://arxiv.org/abs/2509.22642v1
- Date: Fri, 26 Sep 2025 17:59:07 GMT
- Title: WoW: Towards a World omniscient World model Through Embodied Interaction
- Authors: Xiaowei Chi, Peidong Jia, Chun-Kai Fan, Xiaozhu Ju, Weishi Mi, Kevin Zhang, Zhiyuan Qin, Wanxin Tian, Kuangzhi Ge, Hao Li, Zezhong Qian, Anthony Chen, Qiang Zhou, Yueru Jia, Jiaming Liu, Yong Dai, Qingpo Wuwu, Chengyu Bai, Yu-Kai Wang, Ying Li, Lizhang Chen, Yong Bao, Zhiyuan Jiang, Jiacheng Zhu, Kai Tang, Ruichuan An, Yulin Luo, Qiuxuan Feng, Siyuan Zhou, Chi-min Chan, Chengkai Hou, Wei Xue, Sirui Han, Yike Guo, Shanghang Zhang, Jian Tang,
- Abstract summary: Authentic physical intuition of the world model must be grounded in extensive, causally rich interactions with the real world.<n>We present WoW, a generative world model trained on 2 million robot interaction trajectories.<n>We establish WoWBench, a new benchmark focused on physical consistency and causal reasoning in video.
- Score: 83.43543124512719
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Humans develop an understanding of intuitive physics through active interaction with the world. This approach is in stark contrast to current video models, such as Sora, which rely on passive observation and therefore struggle with grasping physical causality. This observation leads to our central hypothesis: authentic physical intuition of the world model must be grounded in extensive, causally rich interactions with the real world. To test this hypothesis, we present WoW, a 14-billion-parameter generative world model trained on 2 million robot interaction trajectories. Our findings reveal that the model's understanding of physics is a probabilistic distribution of plausible outcomes, leading to stochastic instabilities and physical hallucinations. Furthermore, we demonstrate that this emergent capability can be actively constrained toward physical realism by SOPHIA, where vision-language model agents evaluate the DiT-generated output and guide its refinement by iteratively evolving the language instructions. In addition, a co-trained Inverse Dynamics Model translates these refined plans into executable robotic actions, thus closing the imagination-to-action loop. We establish WoWBench, a new benchmark focused on physical consistency and causal reasoning in video, where WoW achieves state-of-the-art performance in both human and autonomous evaluation, demonstrating strong ability in physical causality, collision dynamics, and object permanence. Our work provides systematic evidence that large-scale, real-world interaction is a cornerstone for developing physical intuition in AI. Models, data, and benchmarks will be open-sourced.
Related papers
- Do-Undo: Generating and Reversing Physical Actions in Vision-Language Models [57.71440995598757]
We introduce the Do-Undo task and benchmark to address a critical gap in vision-language models.<n>Do-Undo requires models to simulate the outcome of a physical action and then accurately reverse it, reflecting true cause-and-effect in the visual world.
arXiv Detail & Related papers (2025-12-15T18:03:42Z) - PhysWorld: From Real Videos to World Models of Deformable Objects via Physics-Aware Demonstration Synthesis [52.905353023326306]
We propose PhysWorld, a framework that synthesizes physically plausible and diverse demonstrations to learn efficient world models.<n>Experiments show that PhysWorld has competitive performance while enabling inference speeds 47 times faster than the recent state-of-the-art method, i.e., PhysTwin.
arXiv Detail & Related papers (2025-10-24T13:25:39Z) - PhysHSI: Towards a Real-World Generalizable and Natural Humanoid-Scene Interaction System [67.2851799763138]
PhysHSI comprises a simulation training pipeline and a real-world deployment system.<n>In simulation, we adopt adversarial motion prior-based policy learning to imitate natural humanoid-scene interaction data.<n>For real-world deployment, we introduce a coarse-to-fine object localization module that combines LiDAR and camera inputs.
arXiv Detail & Related papers (2025-10-13T07:11:37Z) - SimGenHOI: Physically Realistic Whole-Body Humanoid-Object Interaction via Generative Modeling and Reinforcement Learning [6.255814224573073]
SimGenHOI is a unified framework that combines the strengths of generative modeling and reinforcement learning to produce controllable and physically plausible HOI.<n>Our HOI generative model, based on Diffusion Transformers (DiT), predicts a set of key actions conditioned on text prompts, object geometry, sparse object waypoints, and the initial humanoid pose.<n>To ensure physical realism, we design a contact-aware whole-body control policy trained with reinforcement learning, which tracks the generated motions while correcting artifacts such as penetration and foot sliding.
arXiv Detail & Related papers (2025-08-18T15:20:46Z) - Towards Immersive Human-X Interaction: A Real-Time Framework for Physically Plausible Motion Synthesis [51.95817740348585]
Human-X is a novel framework designed to enable immersive and physically plausible human interactions across diverse entities.<n>Our method jointly predicts actions and reactions in real-time using an auto-regressive reaction diffusion planner.<n>Our framework is validated in real-world applications, including virtual reality interface for human-robot interaction.
arXiv Detail & Related papers (2025-08-04T06:35:48Z) - SlotPi: Physics-informed Object-centric Reasoning Models [37.32107835829927]
We introduce SlotPi, a physics-informed object-centric reasoning model.<n>Our experiments highlight the model's strengths in tasks such as prediction and Visual Question Answering (VQA) on benchmark and fluid datasets.<n>We have created a real-world dataset encompassing object interactions, fluid dynamics, and fluid-object interactions, on which we validated our model's capabilities.
arXiv Detail & Related papers (2025-06-12T14:53:36Z) - Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning [77.6397528430433]
We present Physical AI models that can understand the physical world and generate appropriate embodied decisions.<n>To represent physical common sense, we use a hierarchical ontology that captures fundamental knowledge about space, time, and physics.<n>For embodied reasoning, we rely on a two-dimensional ontology that generalizes across different physical embodiments.
arXiv Detail & Related papers (2025-03-18T22:06:58Z) - Generative Physical AI in Vision: A Survey [78.07014292304373]
Gene Artificial Intelligence (AI) has rapidly advanced the field of computer vision by enabling machines to create and interpret visual data with unprecedented sophistication.<n>This transformation builds upon a foundation of generative models to produce realistic images, videos, and 3D/4D content.<n>As generative models evolve to increasingly integrate physical realism and dynamic simulation, their potential to function as "world simulators" expands.
arXiv Detail & Related papers (2025-01-19T03:19:47Z) - Visual cognition in multimodal large language models [12.603212933816206]
Recent advancements have rekindled interest in the potential to emulate human-like cognitive abilities.
This paper evaluates the current state of vision-based large language models in the domains of intuitive physics, causal reasoning, and intuitive psychology.
arXiv Detail & Related papers (2023-11-27T18:58:34Z) - Causal World Models by Unsupervised Deconfounding of Physical Dynamics [20.447000858907646]
The capability of imagining internally with a mental model of the world is vitally important for human cognition.
We propose Causal World Models (CWMs) that allow unsupervised modeling of relationships between the intervened and alternative futures.
We show reductions in complexity sample for reinforcement learning tasks and improvements in counterfactual physical reasoning.
arXiv Detail & Related papers (2020-12-28T13:44:36Z) - Visual Grounding of Learned Physical Models [66.04898704928517]
Humans intuitively recognize objects' physical properties and predict their motion, even when the objects are engaged in complicated interactions.
We present a neural model that simultaneously reasons about physics and makes future predictions based on visual and dynamics priors.
Experiments show that our model can infer the physical properties within a few observations, which allows the model to quickly adapt to unseen scenarios and make accurate predictions into the future.
arXiv Detail & Related papers (2020-04-28T17:06:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.