Related papers: PhysGym: Benchmarking LLMs in Interactive Physics Discovery with Controlled Priors

Related papers

GRACE: an Agentic AI for Particle Physics Experiment Design and Simulation [0.0]
GRACE is a simulation-native agent for autonomous experimental design in high-energy and nuclear physics.<n>It autonomously explores design modifications using first-principles Monte Carlo methods.<n>It evaluates candidate designs through repeated simulation, physics-motivated utility functions, and budget-aware escalation.
arXiv Detail & Related papers (2026-01-31T01:12:55Z)
Opportunities in AI/ML for the Rubin LSST Dark Energy Science Collaboration [63.61423859450929]
This white paper surveys the current landscape of AI/ML across DESC's primary cosmological probes and cross-cutting analyses.<n>We identify key methodological research priorities, including Bayesian inference at scale, physics-informed methods, validation frameworks, and active learning for discovery.
arXiv Detail & Related papers (2026-01-20T18:46:42Z)
SciIF: Benchmarking Scientific Instruction Following Towards Rigorous Scientific Intelligence [60.202862987441684]
We introduce scientific instruction following: the capability to solve problems while strictly adhering to the constraints that establish scientific validity.<n>Specifically, we introduce SciIF, a multi-discipline benchmark that evaluates this capability by pairing university-level problems with a fixed catalog of constraints.<n>By measuring both solution correctness and multi-constraint adherence, SciIF enables finegrained diagnosis of compositional reasoning failures.
arXiv Detail & Related papers (2026-01-08T09:45:58Z)
HeurekaBench: A Benchmarking Framework for AI Co-scientist [2.206319727896241]
HeurekaBench is a framework to create benchmarks with exploratory, open-ended research questions for experimental datasets.<n>We instantiate the framework in single-cell biology to obtain sc-HeurekaBench benchmark and use it to compare state-of-the-art single-cell agents.<n>We find that the addition of a critic module can improve ill-formed responses for open-source LLM-based agents by up to 22% and close the gap with their closed-source counterparts.
arXiv Detail & Related papers (2026-01-04T22:16:42Z)
AInsteinBench: Benchmarking Coding Agents on Scientific Repositories [33.48206557020983]
AInsteinBench is a large-scale benchmark for evaluating whether large language model (LLM) agents can operate as scientific computing development agents.<n>AInsteinBench measures a model's ability to move beyond surface-level code generation toward the core competencies required for computational scientific research.
arXiv Detail & Related papers (2025-12-24T08:11:11Z)
From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLMs [65.04549036809557]
We introduce a benchmark built from pedestrian-perspective videos captured with synchronized stereo cameras, LiDAR, and IMU/GPS sensors.<n>This dataset provides metrically precise 3D information, enabling the automatic generation of spatial reasoning questions.<n> Evaluations reveal that the performance gains observed in structured indoor benchmarks vanish in open-world settings.
arXiv Detail & Related papers (2025-12-22T18:58:12Z)
PRiSM: An Agentic Multimodal Benchmark for Scientific Reasoning via Python-Grounded Evaluation [7.0748516420242495]
PRiSM is a synthetic, fully dynamic, and multimodal benchmark for evaluating scientific reasoning via grounded Python code.<n> PRiSM includes over 24,750 university-level physics and math problems, and it leverages our scalable agent-based pipeline, PrismAgent.<n>We propose five targeted evaluation tasks covering perturbation, symbolic program synthesis, robustness, reasoning correction, and ambiguity resolution.
arXiv Detail & Related papers (2025-12-05T18:14:55Z)
SelfAI: Building a Self-Training AI System with LLM Agents [79.10991818561907]
SelfAI is a general multi-agent platform that combines a User Agent for translating high-level research objectives into standardized experimental configurations.<n>An Experiment Manager orchestrates parallel, fault-tolerant training across heterogeneous hardware while maintaining a structured knowledge base for continuous feedback.<n>Across regression, computer vision, scientific computing, medical imaging, and drug discovery benchmarks, SelfAI consistently achieves strong performance and reduces redundant trials.
arXiv Detail & Related papers (2025-11-29T09:18:39Z)
Can Theoretical Physics Research Benefit from Language Agents? [50.57057488167844]
Large Language Models (LLMs) are rapidly advancing across diverse domains, yet their application in theoretical physics research is not yet mature.<n>This position paper argues that LLM agents can potentially help accelerate theoretical, computational, and applied physics when properly integrated with domain knowledge and toolbox.<n>We envision future physics-specialized LLMs that could handle multimodal data, propose testable hypotheses, and design experiments.
arXiv Detail & Related papers (2025-06-06T16:20:06Z)
MOOSE-Chem3: Toward Experiment-Guided Hypothesis Ranking via Simulated Experimental Feedback [136.27567671480156]
We introduce experiment-guided ranking, which prioritizes hypotheses based on feedback from prior tests.<n>We frame experiment-guided ranking as a sequential decision-making problem.<n>Our approach significantly outperforms pre-experiment baselines and strong ablations.
arXiv Detail & Related papers (2025-05-23T13:24:50Z)
APEX: Empowering LLMs with Physics-Based Task Planning for Real-time Insight [3.5385022178794805]
APEX (Anticipatory Physics-Enhanced Execution) is a framework that equips Large Language Models with physics-driven foresight for real-time task planning.<n>APEX significantly outperforms standard LLMs and VLM-based models.
arXiv Detail & Related papers (2025-05-20T04:34:58Z)
Benchmarking LLMs' Swarm intelligence [50.544186914115045]
Large Language Models (LLMs) show potential for complex reasoning, yet their capacity for emergent coordination in Multi-Agent Systems (MAS) remains largely unexplored.<n>We introduce SwarmBench, a novel benchmark designed to systematically evaluate tasks of LLMs acting as decentralized agents.<n>We propose metrics for coordination effectiveness and analyze emergent group dynamics.
arXiv Detail & Related papers (2025-05-07T12:32:01Z)
Physics-Learning AI Datamodel (PLAID) datasets: a collection of physics simulations for machine learning [0.15469999759898032]
PLAID is a framework for representing and sharing datasets of physics simulations.<n> PLAID defines a unified standard for describing simulation data.<n>We release six datasets under the PLAID standard, covering structural mechanics and computational fluid dynamics.
arXiv Detail & Related papers (2025-05-05T18:59:17Z)
Auto-Bench: An Automated Benchmark for Scientific Discovery in LLMs [23.608962459019278]
We introduce a novel benchmark to evaluate Large Language Models (LLMs) for scientific discovery in both natural and social sciences.<n>Our benchmark is based on the principles of causal graph discovery. It challenges models to uncover hidden structures and make optimal decisions, which includes generating valid justifications.<n>We evaluate state-of-the-art LLMs, including GPT-4, Gemini, Qwen, Claude, and Llama, and observe a significant performance drop as the problem complexity increases.
arXiv Detail & Related papers (2025-02-21T05:35:20Z)
MAPS: Advancing Multi-Modal Reasoning in Expert-Level Physical Science [62.96434290874878]
Current Multi-Modal Large Language Models (MLLM) have shown strong capabilities in general visual reasoning tasks.<n>We develop a new framework, named Multi-Modal Scientific Reasoning with Physics Perception and Simulation (MAPS) based on an MLLM.<n>MAPS decomposes expert-level multi-modal reasoning task into physical diagram understanding via a Physical Perception Model (PPM) and reasoning with physical knowledge via a simulator.
arXiv Detail & Related papers (2025-01-18T13:54:00Z)
Using Machine Learning to Discover Parsimonious and Physically-Interpretable Representations of Catchment-Scale Rainfall-Runoff Dynamics [1.1510009152620668]
An underexplored aspect of machine learning is how to develop minimally-optimal representations that can facilitate better insight regarding system functioning.<n>Our own view is that ML-based modeling should be based in use of computational units that are fundamentally easy to interpret in a physical-conceptual sense.<n>We show, in the context of lumped modeling, that physical interpretability and predictive performance can both be achieved using a relatively parsimonious distributed-state multiple-flow-path network.
arXiv Detail & Related papers (2024-12-06T08:30:01Z)
LLMPhy: Complex Physical Reasoning Using Large Language Models and World Models [35.01842161084472]
We propose a new physical reasoning task and a dataset, dubbed TraySim.<n>Our task involves predicting the dynamics of several objects on a tray that is given an external impact.<n>We present LLMPhy, a zero-shot black-box optimization framework that leverages the physics knowledge and program synthesis abilities of LLMs.<n>Our results show that the combination of the LLM and the physics engine leads to state-of-the-art zero-shot physical reasoning performance.
arXiv Detail & Related papers (2024-11-12T18:56:58Z)
LLM and Simulation as Bilevel Optimizers: A New Paradigm to Advance Physical Scientific Discovery [141.39722070734737]
We propose to enhance the knowledge-driven, abstract reasoning abilities of Large Language Models with the computational strength of simulations. We introduce Scientific Generative Agent (SGA), a bilevel optimization framework. We conduct experiments to demonstrate our framework's efficacy in law discovery and molecular design.
arXiv Detail & Related papers (2024-05-16T03:04:10Z)
Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement [67.1393112206885]
Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks. We introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level. We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks.
arXiv Detail & Related papers (2024-02-09T07:45:26Z)
Physics Inspired Hybrid Attention for SAR Target Recognition [61.01086031364307]
We propose a physics inspired hybrid attention (PIHA) mechanism and the once-for-all (OFA) evaluation protocol to address the issues. PIHA leverages the high-level semantics of physical information to activate and guide the feature group aware of local semantics of target. Our method outperforms other state-of-the-art approaches in 12 test scenarios with same ASC parameters.
arXiv Detail & Related papers (2023-09-27T14:39:41Z)
An Extensible Benchmark Suite for Learning to Simulate Physical Systems [60.249111272844374]
We introduce a set of benchmark problems to take a step towards unified benchmarks and evaluation protocols. We propose four representative physical systems, as well as a collection of both widely used classical time-based and representative data-driven methods.
arXiv Detail & Related papers (2021-08-09T17:39:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.