PhysGym: Benchmarking LLMs in Interactive Physics Discovery with Controlled Priors
- URL: http://arxiv.org/abs/2507.15550v2
- Date: Sun, 26 Oct 2025 07:14:27 GMT
- Title: PhysGym: Benchmarking LLMs in Interactive Physics Discovery with Controlled Priors
- Authors: Yimeng Chen, Piotr Piȩkos, Mateusz Ostaszewski, Firas Laakom, Jürgen Schmidhuber,
- Abstract summary: textscPhysGym is a novel benchmark suite and simulation platform for rigorously assessing LLM-based scientific reasoning.<n>textscPhysGym's primary contribution lies in its sophisticated control over the level of prior knowledge provided to the agent.
- Score: 29.988641224102164
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Evaluating the scientific discovery capabilities of large language model based agents, particularly how they cope with varying environmental complexity and utilize prior knowledge, requires specialized benchmarks currently lacking in the landscape. To address this gap, we introduce \textsc{PhysGym}, a novel benchmark suite and simulation platform for rigorously assessing LLM-based scientific reasoning in interactive physics environments. \textsc{PhysGym}'s primary contribution lies in its sophisticated control over the level of prior knowledge provided to the agent. This allows researchers to dissect agent performance along axes including the complexity of the problem and the prior knowledge levels. The benchmark comprises a suite of interactive simulations, where agents must actively probe environments, gather data sequentially under constraints and formulate hypotheses about underlying physical laws. \textsc{PhysGym} provides standardized evaluation protocols and metrics for assessing hypothesis accuracy and model fidelity. We demonstrate the benchmark's utility by presenting results from baseline LLMs, showcasing its ability to differentiate capabilities based on varying priors and task complexity.
Related papers
- GRACE: an Agentic AI for Particle Physics Experiment Design and Simulation [0.0]
GRACE is a simulation-native agent for autonomous experimental design in high-energy and nuclear physics.<n>It autonomously explores design modifications using first-principles Monte Carlo methods.<n>It evaluates candidate designs through repeated simulation, physics-motivated utility functions, and budget-aware escalation.
arXiv Detail & Related papers (2026-01-31T01:12:55Z) - Opportunities in AI/ML for the Rubin LSST Dark Energy Science Collaboration [63.61423859450929]
This white paper surveys the current landscape of AI/ML across DESC's primary cosmological probes and cross-cutting analyses.<n>We identify key methodological research priorities, including Bayesian inference at scale, physics-informed methods, validation frameworks, and active learning for discovery.
arXiv Detail & Related papers (2026-01-20T18:46:42Z) - SciIF: Benchmarking Scientific Instruction Following Towards Rigorous Scientific Intelligence [60.202862987441684]
We introduce scientific instruction following: the capability to solve problems while strictly adhering to the constraints that establish scientific validity.<n>Specifically, we introduce SciIF, a multi-discipline benchmark that evaluates this capability by pairing university-level problems with a fixed catalog of constraints.<n>By measuring both solution correctness and multi-constraint adherence, SciIF enables finegrained diagnosis of compositional reasoning failures.
arXiv Detail & Related papers (2026-01-08T09:45:58Z) - HeurekaBench: A Benchmarking Framework for AI Co-scientist [2.206319727896241]
HeurekaBench is a framework to create benchmarks with exploratory, open-ended research questions for experimental datasets.<n>We instantiate the framework in single-cell biology to obtain sc-HeurekaBench benchmark and use it to compare state-of-the-art single-cell agents.<n>We find that the addition of a critic module can improve ill-formed responses for open-source LLM-based agents by up to 22% and close the gap with their closed-source counterparts.
arXiv Detail & Related papers (2026-01-04T22:16:42Z) - AInsteinBench: Benchmarking Coding Agents on Scientific Repositories [33.48206557020983]
AInsteinBench is a large-scale benchmark for evaluating whether large language model (LLM) agents can operate as scientific computing development agents.<n>AInsteinBench measures a model's ability to move beyond surface-level code generation toward the core competencies required for computational scientific research.
arXiv Detail & Related papers (2025-12-24T08:11:11Z) - From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLMs [65.04549036809557]
We introduce a benchmark built from pedestrian-perspective videos captured with synchronized stereo cameras, LiDAR, and IMU/GPS sensors.<n>This dataset provides metrically precise 3D information, enabling the automatic generation of spatial reasoning questions.<n> Evaluations reveal that the performance gains observed in structured indoor benchmarks vanish in open-world settings.
arXiv Detail & Related papers (2025-12-22T18:58:12Z) - PRiSM: An Agentic Multimodal Benchmark for Scientific Reasoning via Python-Grounded Evaluation [7.0748516420242495]
PRiSM is a synthetic, fully dynamic, and multimodal benchmark for evaluating scientific reasoning via grounded Python code.<n> PRiSM includes over 24,750 university-level physics and math problems, and it leverages our scalable agent-based pipeline, PrismAgent.<n>We propose five targeted evaluation tasks covering perturbation, symbolic program synthesis, robustness, reasoning correction, and ambiguity resolution.
arXiv Detail & Related papers (2025-12-05T18:14:55Z) - SelfAI: Building a Self-Training AI System with LLM Agents [79.10991818561907]
SelfAI is a general multi-agent platform that combines a User Agent for translating high-level research objectives into standardized experimental configurations.<n>An Experiment Manager orchestrates parallel, fault-tolerant training across heterogeneous hardware while maintaining a structured knowledge base for continuous feedback.<n>Across regression, computer vision, scientific computing, medical imaging, and drug discovery benchmarks, SelfAI consistently achieves strong performance and reduces redundant trials.
arXiv Detail & Related papers (2025-11-29T09:18:39Z) - Can Theoretical Physics Research Benefit from Language Agents? [50.57057488167844]
Large Language Models (LLMs) are rapidly advancing across diverse domains, yet their application in theoretical physics research is not yet mature.<n>This position paper argues that LLM agents can potentially help accelerate theoretical, computational, and applied physics when properly integrated with domain knowledge and toolbox.<n>We envision future physics-specialized LLMs that could handle multimodal data, propose testable hypotheses, and design experiments.
arXiv Detail & Related papers (2025-06-06T16:20:06Z) - MOOSE-Chem3: Toward Experiment-Guided Hypothesis Ranking via Simulated Experimental Feedback [136.27567671480156]
We introduce experiment-guided ranking, which prioritizes hypotheses based on feedback from prior tests.<n>We frame experiment-guided ranking as a sequential decision-making problem.<n>Our approach significantly outperforms pre-experiment baselines and strong ablations.
arXiv Detail & Related papers (2025-05-23T13:24:50Z) - APEX: Empowering LLMs with Physics-Based Task Planning for Real-time Insight [3.5385022178794805]
APEX (Anticipatory Physics-Enhanced Execution) is a framework that equips Large Language Models with physics-driven foresight for real-time task planning.<n>APEX significantly outperforms standard LLMs and VLM-based models.
arXiv Detail & Related papers (2025-05-20T04:34:58Z) - Benchmarking LLMs' Swarm intelligence [50.544186914115045]
Large Language Models (LLMs) show potential for complex reasoning, yet their capacity for emergent coordination in Multi-Agent Systems (MAS) remains largely unexplored.<n>We introduce SwarmBench, a novel benchmark designed to systematically evaluate tasks of LLMs acting as decentralized agents.<n>We propose metrics for coordination effectiveness and analyze emergent group dynamics.
arXiv Detail & Related papers (2025-05-07T12:32:01Z) - Physics-Learning AI Datamodel (PLAID) datasets: a collection of physics simulations for machine learning [0.15469999759898032]
PLAID is a framework for representing and sharing datasets of physics simulations.<n> PLAID defines a unified standard for describing simulation data.<n>We release six datasets under the PLAID standard, covering structural mechanics and computational fluid dynamics.
arXiv Detail & Related papers (2025-05-05T18:59:17Z) - Auto-Bench: An Automated Benchmark for Scientific Discovery in LLMs [23.608962459019278]
We introduce a novel benchmark to evaluate Large Language Models (LLMs) for scientific discovery in both natural and social sciences.<n>Our benchmark is based on the principles of causal graph discovery. It challenges models to uncover hidden structures and make optimal decisions, which includes generating valid justifications.<n>We evaluate state-of-the-art LLMs, including GPT-4, Gemini, Qwen, Claude, and Llama, and observe a significant performance drop as the problem complexity increases.
arXiv Detail & Related papers (2025-02-21T05:35:20Z) - MAPS: Advancing Multi-Modal Reasoning in Expert-Level Physical Science [62.96434290874878]
Current Multi-Modal Large Language Models (MLLM) have shown strong capabilities in general visual reasoning tasks.<n>We develop a new framework, named Multi-Modal Scientific Reasoning with Physics Perception and Simulation (MAPS) based on an MLLM.<n>MAPS decomposes expert-level multi-modal reasoning task into physical diagram understanding via a Physical Perception Model (PPM) and reasoning with physical knowledge via a simulator.
arXiv Detail & Related papers (2025-01-18T13:54:00Z) - Using Machine Learning to Discover Parsimonious and Physically-Interpretable Representations of Catchment-Scale Rainfall-Runoff Dynamics [1.1510009152620668]
An underexplored aspect of machine learning is how to develop minimally-optimal representations that can facilitate better insight regarding system functioning.<n>Our own view is that ML-based modeling should be based in use of computational units that are fundamentally easy to interpret in a physical-conceptual sense.<n>We show, in the context of lumped modeling, that physical interpretability and predictive performance can both be achieved using a relatively parsimonious distributed-state multiple-flow-path network.
arXiv Detail & Related papers (2024-12-06T08:30:01Z) - LLMPhy: Complex Physical Reasoning Using Large Language Models and World Models [35.01842161084472]
We propose a new physical reasoning task and a dataset, dubbed TraySim.<n>Our task involves predicting the dynamics of several objects on a tray that is given an external impact.<n>We present LLMPhy, a zero-shot black-box optimization framework that leverages the physics knowledge and program synthesis abilities of LLMs.<n>Our results show that the combination of the LLM and the physics engine leads to state-of-the-art zero-shot physical reasoning performance.
arXiv Detail & Related papers (2024-11-12T18:56:58Z) - LLM and Simulation as Bilevel Optimizers: A New Paradigm to Advance Physical Scientific Discovery [141.39722070734737]
We propose to enhance the knowledge-driven, abstract reasoning abilities of Large Language Models with the computational strength of simulations.
We introduce Scientific Generative Agent (SGA), a bilevel optimization framework.
We conduct experiments to demonstrate our framework's efficacy in law discovery and molecular design.
arXiv Detail & Related papers (2024-05-16T03:04:10Z) - Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement [67.1393112206885]
Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks.
We introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level.
We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks.
arXiv Detail & Related papers (2024-02-09T07:45:26Z) - Physics Inspired Hybrid Attention for SAR Target Recognition [61.01086031364307]
We propose a physics inspired hybrid attention (PIHA) mechanism and the once-for-all (OFA) evaluation protocol to address the issues.
PIHA leverages the high-level semantics of physical information to activate and guide the feature group aware of local semantics of target.
Our method outperforms other state-of-the-art approaches in 12 test scenarios with same ASC parameters.
arXiv Detail & Related papers (2023-09-27T14:39:41Z) - An Extensible Benchmark Suite for Learning to Simulate Physical Systems [60.249111272844374]
We introduce a set of benchmark problems to take a step towards unified benchmarks and evaluation protocols.
We propose four representative physical systems, as well as a collection of both widely used classical time-based and representative data-driven methods.
arXiv Detail & Related papers (2021-08-09T17:39:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.