IPR-1: Interactive Physical Reasoner
- URL: http://arxiv.org/abs/2511.15407v1
- Date: Wed, 19 Nov 2025 13:04:44 GMT
- Title: IPR-1: Interactive Physical Reasoner
- Authors: Mingyu Zhang, Lifeng Zhuo, Tianxi Tan, Guocan Xie, Xian Nie, Yan Li, Renjie Zhao, Zizhu He, Ziyu Wang, Jiting Cai, Yong-Lu Li,
- Abstract summary: We aim to ask whether an agent can similarly acquire human-like reasoning from interaction and keep improving with more experience.<n>We study this in a Game-to-Unseen (G2U) setting, curating 1,000+ heterogeneous games with diverse physical and causal mechanisms.
- Score: 12.534108491269954
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans learn by observing, interacting with environments, and internalizing physics and causality. Here, we aim to ask whether an agent can similarly acquire human-like reasoning from interaction and keep improving with more experience. We study this in a Game-to-Unseen (G2U) setting, curating 1,000+ heterogeneous games with diverse physical and causal mechanisms, and evaluate at three human-like levels: Survival, Curiosity, Utility, from primitive intuition to goal-driven reasoning. Our analysis reveals complementary failures: VLM/VLA agents reason but lack look-ahead in interactive settings, while world models imagine but imitate visual patterns rather than analyze physics and causality. We therefore propose IPR (Interactive Physical Reasoner), using world-model rollouts to score and reinforce a VLM's policy, and introduce PhysCode, a physics-centric action code aligning semantic intent with dynamics to provide a shared action space for prediction and reasoning. Pretrained on 1,000+ games, our IPR performs robustly on three levels, matches GPT-5 overall, and surpasses it on Curiosity. We find that performance improves with more training games and interaction steps, and that the model also zero-shot transfers to unseen games. These results support physics-centric interaction as a path to steadily improving physical reasoning.
Related papers
- Order from Chaos: Physical World Understanding from Glitchy Gameplay Videos [82.4003989236851]
We propose a novel paradigm that leverages glitches in gameplay videos, referring to visual anomalies that violate predefined physical laws, as a rich and scalable supervision source for physical world understanding.<n>We introduce PhysGame, a dataset containing 140,057 glitch-centric question-answer pairs across five physical domains and sixteen fine-grained categories.<n>Experiments show that PhysGame significantly enhances both Game2Real transferability, improving the real world physical reasoning performance of Qwen2.5VL by 2.5%, and Game2General transferability, yielding a 1.9% gain on the MVBench benchmark.
arXiv Detail & Related papers (2026-01-23T06:02:07Z) - EscherVerse: An Open World Benchmark and Dataset for Teleo-Spatial Intelligence with Physical-Dynamic and Intent-Driven Understanding [56.89359230139883]
We introduce Teleo-Spatial Intelligence (TSI), a new paradigm that unifies two critical pillars: Physical-Dynamic Reasoning and Intent-Driven Reasoning.<n>We present EscherVerse, consisting of a large-scale, open-world benchmark (Escher-Bench), a dataset (Escher-35k), and models (Escher series)<n>It is the first benchmark to systematically assess Intent-Driven Reasoning, challenging models to connect physical events to their underlying human purposes.
arXiv Detail & Related papers (2026-01-04T14:42:39Z) - WoW: Towards a World omniscient World model Through Embodied Interaction [83.43543124512719]
Authentic physical intuition of the world model must be grounded in extensive, causally rich interactions with the real world.<n>We present WoW, a generative world model trained on 2 million robot interaction trajectories.<n>We establish WoWBench, a new benchmark focused on physical consistency and causal reasoning in video.
arXiv Detail & Related papers (2025-09-26T17:59:07Z) - Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning [77.6397528430433]
We present Physical AI models that can understand the physical world and generate appropriate embodied decisions.<n>To represent physical common sense, we use a hierarchical ontology that captures fundamental knowledge about space, time, and physics.<n>For embodied reasoning, we rely on a two-dimensional ontology that generalizes across different physical embodiments.
arXiv Detail & Related papers (2025-03-18T22:06:58Z) - PhysHOI: Physics-Based Imitation of Dynamic Human-Object Interaction [22.48933099236595]
We present PhysHOI, the first physics-based whole-body HOI imitation approach without task-specific reward designs.
Except for the kinematic HOI representations of humans and objects, we introduce the contact graph to model the contact relations between body parts and objects explicitly.
Based on the key designs, PhysHOI can imitate diverse HOI tasks simply yet effectively without prior knowledge.
arXiv Detail & Related papers (2023-12-07T16:06:31Z) - Measuring and Modeling Physical Intrinsic Motivation [4.995872423496944]
Humans are interactive agents driven to seek out situations with interesting physical dynamics.
We first collect ratings of how interesting humans find a variety of physics scenarios.
We then model human interestingness responses by implementing various hypotheses of intrinsic motivation.
arXiv Detail & Related papers (2023-05-22T19:52:08Z) - Dynamic Visual Reasoning by Learning Differentiable Physics Models from
Video and Language [92.7638697243969]
We propose a unified framework that can jointly learn visual concepts and infer physics models of objects from videos and language.
This is achieved by seamlessly integrating three components: a visual perception module, a concept learner, and a differentiable physics engine.
arXiv Detail & Related papers (2021-10-28T17:59:13Z) - SPACE: A Simulator for Physical Interactions and Causal Learning in 3D
Environments [2.105564340986074]
We introduce SPACE: A Simulator for Physical Interactions and Causal Learning in 3D Environments.
Inspired by daily object interactions, the SPACE dataset comprises videos depicting three types of physical events: containment, stability and contact.
We show that the SPACE dataset improves the learning of intuitive physics with an approach inspired by curriculum learning.
arXiv Detail & Related papers (2021-08-13T11:49:46Z) - Use the Force, Luke! Learning to Predict Physical Forces by Simulating
Effects [79.351446087227]
We address the problem of inferring contact points and the physical forces from videos of humans interacting with objects.
Specifically, we use a simulator to predict effects and enforce that estimated forces must lead to the same effect as depicted in the video.
arXiv Detail & Related papers (2020-03-26T17:20:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.