Related papers: PHYSICS: Benchmarking Foundation Models on University-Level Physics Problem Solving

PHYSICS: Benchmarking Foundation Models on University-Level Physics Problem Solving

URL: http://arxiv.org/abs/2503.21821v1
Date: Wed, 26 Mar 2025 06:21:56 GMT
Title: PHYSICS: Benchmarking Foundation Models on University-Level Physics Problem Solving
Authors: Kaiyue Feng, Yilun Zhao, Yixin Liu, Tianyu Yang, Chen Zhao, John Sous, Arman Cohan,
Abstract summary: We introduce PHYSICS, a comprehensive benchmark for university-level physics problem solving.<n>It contains 1297 expert-annotated problems covering six core areas: classical mechanics, quantum mechanics, thermodynamics and statistical mechanics, electromagnetism, atomic physics, and optics.
Score: 38.44445350202585
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce PHYSICS, a comprehensive benchmark for university-level physics problem solving. It contains 1297 expert-annotated problems covering six core areas: classical mechanics, quantum mechanics, thermodynamics and statistical mechanics, electromagnetism, atomic physics, and optics. Each problem requires advanced physics knowledge and mathematical reasoning. We develop a robust automated evaluation system for precise and reliable validation. Our evaluation of leading foundation models reveals substantial limitations. Even the most advanced model, o3-mini, achieves only 59.9% accuracy, highlighting significant challenges in solving high-level scientific problems. Through comprehensive error analysis, exploration of diverse prompting strategies, and Retrieval-Augmented Generation (RAG)-based knowledge augmentation, we identify key areas for improvement, laying the foundation for future advancements.

Related papers

PhysicsEval: Inference-Time Techniques to Improve the Reasoning Proficiency of Large Language Models on Physics Problems [3.0901186959880977]
We evaluate the performance of frontier LLMs in solving physics problems, both mathematical and descriptive.<n>We introduce a new evaluation benchmark for physics problems, $rm Psmall HYSICSEsmall VAL$, consisting of 19,609 problems sourced from various physics textbooks.
arXiv Detail & Related papers (2025-07-31T18:12:51Z)
PhysUniBench: An Undergraduate-Level Physics Reasoning Benchmark for Multimodal Models [69.73115077227969]
We present PhysUniBench, a large-scale benchmark designed to evaluate and improve the reasoning capabilities of large language models (MLLMs)<n>PhysUniBench consists of 3,304 physics questions spanning 8 major sub-disciplines of physics, each accompanied by one visual diagram.<n>The benchmark's construction involved a rigorous multi-stage process, including multiple roll-outs, expert-level evaluation, automated filtering of easily solved problems, and a nuanced difficulty grading system with five levels.
arXiv Detail & Related papers (2025-06-21T09:55:42Z)
Can Theoretical Physics Research Benefit from Language Agents? [50.57057488167844]
Large Language Models (LLMs) are rapidly advancing across diverse domains, yet their application in theoretical physics research is not yet mature.<n>This position paper argues that LLM agents can potentially help accelerate theoretical, computational, and applied physics when properly integrated with domain knowledge and toolbox.<n>We envision future physics-specialized LLMs that could handle multimodal data, propose testable hypotheses, and design experiments.
arXiv Detail & Related papers (2025-06-06T16:20:06Z)
SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning [89.48883747910448]
We present SeePhys, a large-scale multimodal benchmark for reasoning grounded in physics questions.<n>The benchmark covers 7 fundamental domains spanning the physics discipline, incorporating 21 categories of highly heterogeneous diagrams.<n>We observe that even the most advanced visual reasoning models (e.g., Gemini-2.5-pro and o4-mini) achieve sub-60% accuracy on our benchmark.
arXiv Detail & Related papers (2025-05-25T11:28:34Z)
Scaling Physical Reasoning with the PHYSICS Dataset [32.956687630330116]
PHYSICS is a dataset containing 16,568 high-quality physics problems spanning subjects and difficulty levels.<n>It covers five major physics domains: Mechanics, Electromagnetism, Thermodynamics, Optics, and Modern Physics.<n>It also spans a wide range of difficulty levels, from high school to graduate-level physics courses.
arXiv Detail & Related papers (2025-05-21T17:06:28Z)
PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning [36.193595420239845]
We present PhysReason, a 1,200-problem benchmark for evaluating large language models.<n>Problems require an average of 8.1 solution steps, with hard requiring 15.6.<n>Top-performing models like Deepseek-R1, Gemini-2.0-Flash-Thinking, and o3-mini-high achieve less than 60% on answer-level evaluation.
arXiv Detail & Related papers (2025-02-17T17:24:14Z)
UGPhysics: A Comprehensive Benchmark for Undergraduate Physics Reasoning with Large Language Models [39.917074900737575]
Large language models (LLMs) have demonstrated remarkable capabilities in solving complex reasoning tasks.<n>The domain of physics reasoning presents unique challenges that have received significantly less attention.<n>Existing benchmarks often fall short in evaluating LLMs' abilities on the breadth and depth of undergraduate-level physics.
arXiv Detail & Related papers (2025-02-01T06:42:02Z)
Physics Reasoner: Knowledge-Augmented Reasoning for Solving Physics Problems with Large Language Models [41.88825441287559]
Existing large language models (LLMs) frequently fail due to a lack of knowledge or incorrect knowledge application.<n>We propose Physics Reasoner, a knowledge-augmented framework to solve physics problems with LLMs.<n>Given a physics problem, Physics Reasoner solves it through three stages: problem analysis, formula retrieval, and guided reasoning.<n> Empirically, Physics Reasoner mitigates the issues of insufficient knowledge and incorrect application, achieving state-of-the-art performance on SciBench with an average accuracy improvement of 5.8%.
arXiv Detail & Related papers (2024-12-18T12:33:50Z)
SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models [70.5763210869525]
We introduce an expansive benchmark suite SciBench for Large Language Model (LLM) SciBench contains a dataset featuring a range of collegiate-level scientific problems from mathematics, chemistry, and physics domains. The results reveal that the current LLMs fall short of delivering satisfactory performance, with the best overall score of merely 43.22%.
arXiv Detail & Related papers (2023-07-20T07:01:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.