VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction
- URL: http://arxiv.org/abs/2602.13294v1
- Date: Mon, 09 Feb 2026 05:46:44 GMT
- Title: VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction
- Authors: Jiarong Liang, Max Ku, Ka-Hei Hui, Ping Nie, Wenhu Chen,
- Abstract summary: VisPhyWorld is an execution-based framework that evaluates physical reasoning.<n>By producing runnable code, the inferred world representation is directly inspectable, editable, and falsifiable.<n>We show that while state-of-the-art MLLMs achieve strong semantic scene understanding, they struggle to accurately infer physical parameters and to simulate consistent physical dynamics.
- Score: 48.60465268759689
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Evaluating whether Multimodal Large Language Models (MLLMs) genuinely reason about physical dynamics remains challenging. Most existing benchmarks rely on recognition-style protocols such as Visual Question Answering (VQA) and Violation of Expectation (VoE), which can often be answered without committing to an explicit, testable physical hypothesis. We propose VisPhyWorld, an execution-based framework that evaluates physical reasoning by requiring models to generate executable simulator code from visual observations. By producing runnable code, the inferred world representation is directly inspectable, editable, and falsifiable. This separates physical reasoning from rendering. Building on this framework, we introduce VisPhyBench, comprising 209 evaluation scenes derived from 108 physical templates and a systematic protocol that evaluates how well models reconstruct appearance and reproduce physically plausible motion. Our pipeline produces valid reconstructed videos in 97.7% on the benchmark. Experiments show that while state-of-the-art MLLMs achieve strong semantic scene understanding, they struggle to accurately infer physical parameters and to simulate consistent physical dynamics.
Related papers
- Perceptual Self-Reflection in Agentic Physics Simulation Code Generation [0.0]
We present a framework for generating physics simulation code from natural language descriptions.<n>Key innovation is perceptual validation, which analyzes rendered animation frames using a vision-capable language model.<n>We evaluate the system across seven domains including classical mechanics, fluid dynamics, thermodynamics, electromagnetics, wave physics, reaction-diffusion systems, and non-physics data visualization.
arXiv Detail & Related papers (2026-02-12T15:48:33Z) - SimuScene: Training and Benchmarking Code Generation to Simulate Physical Scenarios [71.65387146697319]
Large language models (LLMs) have been extensively studied for tasks like math competitions, complex coding, and scientific reasoning.<n>We propose SimuScene, the first systematic study that trains and evaluates LLMs on simulating physical scenarios.<n>We build an automatic pipeline to collect data, with human verification to ensure quality.
arXiv Detail & Related papers (2026-02-11T13:26:02Z) - ProPhy: Progressive Physical Alignment for Dynamic World Simulation [55.456455952212416]
ProPhy is a Progressive Physical Alignment Framework that enables explicit physics-aware conditioning and anisotropic generation.<n>We show that ProPhy produces more realistic, dynamic, and physically coherent results than existing state-of-the-art methods.
arXiv Detail & Related papers (2025-12-05T09:39:26Z) - PhysChoreo: Physics-Controllable Video Generation with Part-Aware Semantic Grounding [50.454084539837005]
PhysChoreo is a novel framework that can generate videos with diverse controllability and physical realism from a single image.<n>Our method consists of two stages: first, it estimates the static initial physical properties of all objects in the image through part-aware physical property reconstruction.<n>Then, through temporally instructed and physically editable simulation, it synthesizes high-quality videos with rich dynamic behaviors and physical realism.
arXiv Detail & Related papers (2025-11-25T17:59:04Z) - TRAVL: A Recipe for Making Video-Language Models Better Judges of Physics Implausibility [70.24211591214528]
Video generative models produce sequences that violate intuitive physical laws, such as objects floating, teleporting, or morphing.<n>Existing Video-Language Models (VLMs) struggle to identify physics violations, exposing fundamental limitations in their temporal and causal reasoning.<n>We introduce TRAVL, a fine-tuning recipe that combines a balanced training dataset with a trajectory-aware attention module to improve motion encoding.<n>We propose ImplausiBench, a benchmark of 300 videos (150 real, 150 generated) that removes linguistic biases and isolates visual-temporal understanding.
arXiv Detail & Related papers (2025-10-08T21:03:46Z) - Think Before You Diffuse: Infusing Physical Rules into Video Diffusion [55.046699347579455]
The complexity of real-world motions, interactions, and dynamics introduce great difficulties when learning physics from data.<n>We propose DiffPhy, a generic framework that enables physically-correct and photo-realistic video generation by fine-tuning a pre-trained video diffusion model.
arXiv Detail & Related papers (2025-05-27T18:26:43Z) - Physics Context Builders: A Modular Framework for Physical Reasoning in Vision-Language Models [11.282655911647483]
Physical reasoning remains a significant challenge for Vision-Language Models (VLMs)<n>We introduce Physics Context Builders (PCBs), a modular framework where specialized smaller VLMs are fine-tuned to generate detailed physical scene descriptions.<n>PCBs enable the separation of visual perception from reasoning, allowing us to analyze their relative contributions to physical understanding.
arXiv Detail & Related papers (2024-12-11T18:40:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.