SimuScene: Training and Benchmarking Code Generation to Simulate Physical Scenarios
- URL: http://arxiv.org/abs/2602.10840v1
- Date: Wed, 11 Feb 2026 13:26:02 GMT
- Title: SimuScene: Training and Benchmarking Code Generation to Simulate Physical Scenarios
- Authors: Yanan Wang, Renxi Wang, Yongxin Wang, Xuezhi Liang, Fajri Koto, Timothy Baldwin, Xiaodan Liang, Haonan Li,
- Abstract summary: Large language models (LLMs) have been extensively studied for tasks like math competitions, complex coding, and scientific reasoning.<n>We propose SimuScene, the first systematic study that trains and evaluates LLMs on simulating physical scenarios.<n>We build an automatic pipeline to collect data, with human verification to ensure quality.
- Score: 71.65387146697319
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have been extensively studied for tasks like math competitions, complex coding, and scientific reasoning, yet their ability to accurately represent and simulate physical scenarios via code remains underexplored. We propose SimuScene, the first systematic study that trains and evaluates LLMs on simulating physical scenarios across five physics domains and 52 physical concepts. We build an automatic pipeline to collect data, with human verification to ensure quality. The final dataset contains 7,659 physical scenarios with 334 human-verified examples as the test set. We evaluated 10 contemporary LLMs and found that even the strongest model achieves only a 21.5% pass rate, demonstrating the difficulty of the task. Finally, we introduce a reinforcement learning pipeline with visual rewards that uses a vision-language model as a judge to train textual models. Experiments show that training with our data improves physical simulation via code while substantially enhancing general code generation performance.
Related papers
- VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction [48.60465268759689]
VisPhyWorld is an execution-based framework that evaluates physical reasoning.<n>By producing runnable code, the inferred world representation is directly inspectable, editable, and falsifiable.<n>We show that while state-of-the-art MLLMs achieve strong semantic scene understanding, they struggle to accurately infer physical parameters and to simulate consistent physical dynamics.
arXiv Detail & Related papers (2026-02-09T05:46:44Z) - RealPDEBench: A Benchmark for Complex Physical Systems with Real-World Data [25.53943767088309]
We introduce RealPDEBench, the first benchmark for scientific Machine Learning (ML) that integrates real-world measurements with paired numerical simulations.<n>RealPDEBench consists of five datasets, three tasks, eight metrics, and ten baselines.<n> Experiments reveal significant discrepancies between simulated and real-world data, while showing that pretraining with simulated data consistently improves both accuracy and convergence.
arXiv Detail & Related papers (2026-01-05T06:49:13Z) - FEM-Bench: A Structured Scientific Reasoning Benchmark for Evaluating Code-Generating LLMs [2.3052479658146323]
We introduce FEM-Bench, a benchmark to evaluate the ability of LLMs to generate correct finite element method (FEM) and related code.<n>These tasks capture essential numerical and physical modeling challenges while representing only a small fraction of the complexity present in the discipline.<n>The best performing model at function writing, Gemini 3 Pro, completed 30/33 tasks at least once and 26/33 tasks all five times.
arXiv Detail & Related papers (2025-12-23T19:40:51Z) - SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models [60.80050275581661]
Vision-Language Models (VLMs) exhibit remarkable common-sense and semantic reasoning capabilities.<n>They lack a grounded understanding of physical dynamics.<n>We present S, a test-time, SIMulation-enabled ACTion Planning framework.<n>Our method demonstrates state-of-the-art performance on five challenging, real-world rigid-body and deformable manipulation tasks.
arXiv Detail & Related papers (2025-12-05T18:51:03Z) - SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors [58.87134689752605]
We introduce SimBench, the first large-scale, standardized benchmark for a robust, reproducible science of LLM simulation.<n>We show that even the best LLMs today have limited simulation ability (score: 40.80/100), performance scales log-linearly with model size.<n>We demonstrate that simulation ability correlates most strongly with deep, knowledge-intensive reasoning.
arXiv Detail & Related papers (2025-10-20T13:14:38Z) - PhysiX: A Foundation Model for Physics Simulations [27.359872113159405]
We introduce PhysiX, the first large-scale foundation model for physics simulation.<n>We show that PhysiX effectively addresses the data bottleneck, outperforming task-specific baselines.<n>Our results indicate that knowledge learned from natural videos can be successfully transferred to physics simulation.
arXiv Detail & Related papers (2025-06-21T18:10:12Z) - Physics Context Builders: A Modular Framework for Physical Reasoning in Vision-Language Models [11.282655911647483]
Physical reasoning remains a significant challenge for Vision-Language Models (VLMs)<n>We introduce Physics Context Builders (PCBs), a modular framework where specialized smaller VLMs are fine-tuned to generate detailed physical scene descriptions.<n>PCBs enable the separation of visual perception from reasoning, allowing us to analyze their relative contributions to physical understanding.
arXiv Detail & Related papers (2024-12-11T18:40:16Z) - MBDS: A Multi-Body Dynamics Simulation Dataset for Graph Networks Simulators [4.5353840616537555]
Graph Network Simulators (GNS) have emerged as the leading method for modeling physical phenomena.
We have constructed a high-quality physical simulation dataset encompassing 1D, 2D, and 3D scenes.
A key feature of our dataset is the inclusion of precise multi-body dynamics, facilitating a more realistic simulation of the physical world.
arXiv Detail & Related papers (2024-10-04T03:03:06Z) - Hindsight States: Blending Sim and Real Task Elements for Efficient
Reinforcement Learning [61.3506230781327]
In robotics, one approach to generate training data builds on simulations based on dynamics models derived from first principles.
Here, we leverage the imbalance in complexity of the dynamics to learn more sample-efficiently.
We validate our method on several challenging simulated tasks and demonstrate that it improves learning both alone and when combined with an existing hindsight algorithm.
arXiv Detail & Related papers (2023-03-03T21:55:04Z) - Task2Sim : Towards Effective Pre-training and Transfer from Synthetic
Data [74.66568380558172]
We study the transferability of pre-trained models based on synthetic data generated by graphics simulators to downstream tasks.
We introduce Task2Sim, a unified model mapping downstream task representations to optimal simulation parameters.
It learns this mapping by training to find the set of best parameters on a set of "seen" tasks.
Once trained, it can then be used to predict best simulation parameters for novel "unseen" tasks in one shot.
arXiv Detail & Related papers (2021-11-30T19:25:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.