Related papers: PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning

PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning

URL: http://arxiv.org/abs/2502.12054v1
Date: Mon, 17 Feb 2025 17:24:14 GMT
Title: PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning
Authors: Xinyu Zhang, Yuxuan Dong, Yanrui Wu, Jiaxing Huang, Chengyou Jia, Basura Fernando, Mike Zheng Shou, Lingling Zhang, Jun Liu,
Abstract summary: We present PhysReason, a 1,200-problem benchmark for evaluating large language models.<n>Problems require an average of 8.1 solution steps, with hard requiring 15.6.<n>Top-performing models like Deepseek-R1, Gemini-2.0-Flash-Thinking, and o3-mini-high achieve less than 60% on answer-level evaluation.
Score: 36.193595420239845
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large language models demonstrate remarkable capabilities across various domains, especially mathematics and logic reasoning. However, current evaluations overlook physics-based reasoning - a complex task requiring physics theorems and constraints. We present PhysReason, a 1,200-problem benchmark comprising knowledge-based (25%) and reasoning-based (75%) problems, where the latter are divided into three difficulty levels (easy, medium, hard). Notably, problems require an average of 8.1 solution steps, with hard requiring 15.6, reflecting the complexity of physics-based reasoning. We propose the Physics Solution Auto Scoring Framework, incorporating efficient answer-level and comprehensive step-level evaluations. Top-performing models like Deepseek-R1, Gemini-2.0-Flash-Thinking, and o3-mini-high achieve less than 60% on answer-level evaluation, with performance dropping from knowledge questions (75.11%) to hard problems (31.95%). Through step-level evaluation, we identified four key bottlenecks: Physics Theorem Application, Physics Process Understanding, Calculation, and Physics Condition Analysis. These findings position PhysReason as a novel and comprehensive benchmark for evaluating physics-based reasoning capabilities in large language models. Our code and data will be published at https:/dxzxy12138.github.io/PhysReason.

Related papers

PhysUniBench: An Undergraduate-Level Physics Reasoning Benchmark for Multimodal Models [69.73115077227969]
We present PhysUniBench, a large-scale benchmark designed to evaluate and improve the reasoning capabilities of large language models (MLLMs)<n>PhysUniBench consists of 3,304 physics questions spanning 8 major sub-disciplines of physics, each accompanied by one visual diagram.<n>The benchmark's construction involved a rigorous multi-stage process, including multiple roll-outs, expert-level evaluation, automated filtering of easily solved problems, and a nuanced difficulty grading system with five levels.
arXiv Detail & Related papers (2025-06-21T09:55:42Z)
SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning [89.48883747910448]
We present SeePhys, a large-scale multimodal benchmark for reasoning grounded in physics questions.<n>The benchmark covers 7 fundamental domains spanning the physics discipline, incorporating 21 categories of highly heterogeneous diagrams.<n>We observe that even the most advanced visual reasoning models (e.g., Gemini-2.5-pro and o4-mini) achieve sub-60% accuracy on our benchmark.
arXiv Detail & Related papers (2025-05-25T11:28:34Z)
PhyX: Does Your Model Have the "Wits" for Physical Reasoning? [49.083544963243206]
Existing benchmarks fail to capture a crucial aspect of intelligence: physical reasoning.<n>We introduce PhyX: the first large-scale benchmark designed to assess models capacity for physics-grounded reasoning in visual scenarios.
arXiv Detail & Related papers (2025-05-21T18:33:50Z)
Scaling Physical Reasoning with the PHYSICS Dataset [32.956687630330116]
PHYSICS is a dataset containing 16,568 high-quality physics problems spanning subjects and difficulty levels.<n>It covers five major physics domains: Mechanics, Electromagnetism, Thermodynamics, Optics, and Modern Physics.<n>It also spans a wide range of difficulty levels, from high school to graduate-level physics courses.
arXiv Detail & Related papers (2025-05-21T17:06:28Z)
PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models [30.597050689757605]
PHYBench is a benchmark for evaluating reasoning capabilities of large language models (LLMs) in physical contexts. It consists of 500 physics problems based on real-world physical scenarios, covering mechanics, electromagnetism, thermodynamics, optics, modern physics, and advanced physics. We also propose the Expression Edit Distance (EED) Score, a novel evaluation metric based on the edit distance between mathematical expressions.
arXiv Detail & Related papers (2025-04-22T17:53:29Z)
Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models [86.45058529521258]
OlymMATH is a novel Olympiad-level mathematical benchmark designed to rigorously test the complex reasoning capabilities of LLMs. OlymMATH features 200 meticulously curated problems, each manually verified and available in parallel English and Chinese versions.
arXiv Detail & Related papers (2025-03-27T11:20:17Z)
PHYSICS: Benchmarking Foundation Models on University-Level Physics Problem Solving [38.44445350202585]
We introduce PHYSICS, a comprehensive benchmark for university-level physics problem solving.<n>It contains 1297 expert-annotated problems covering six core areas: classical mechanics, quantum mechanics, thermodynamics and statistical mechanics, electromagnetism, atomic physics, and optics.
arXiv Detail & Related papers (2025-03-26T06:21:56Z)
Theoretical Physics Benchmark (TPBench) -- a Dataset and Study of AI Reasoning Capabilities in Theoretical Physics [13.530403536762064]
We introduce a benchmark to evaluate the capability of AI to solve problems in theoretical physics, focusing on high-energy theory and cosmology. The first iteration of our benchmark consists of 57 problems of varying difficulty, from undergraduate to research level. We evaluate our data set on various open and closed language models, including o3-mini, o1, DeepSeek-R1, GPT-4o and versions of Llama and Qwen.
arXiv Detail & Related papers (2025-02-19T19:00:00Z)
UGPhysics: A Comprehensive Benchmark for Undergraduate Physics Reasoning with Large Language Models [39.917074900737575]
Large language models (LLMs) have demonstrated remarkable capabilities in solving complex reasoning tasks.<n>The domain of physics reasoning presents unique challenges that have received significantly less attention.<n>Existing benchmarks often fall short in evaluating LLMs' abilities on the breadth and depth of undergraduate-level physics.
arXiv Detail & Related papers (2025-02-01T06:42:02Z)
Physics Reasoner: Knowledge-Augmented Reasoning for Solving Physics Problems with Large Language Models [41.88825441287559]
Existing large language models (LLMs) frequently fail due to a lack of knowledge or incorrect knowledge application.<n>We propose Physics Reasoner, a knowledge-augmented framework to solve physics problems with LLMs.<n>Given a physics problem, Physics Reasoner solves it through three stages: problem analysis, formula retrieval, and guided reasoning.<n> Empirically, Physics Reasoner mitigates the issues of insufficient knowledge and incorrect application, achieving state-of-the-art performance on SciBench with an average accuracy improvement of 5.8%.
arXiv Detail & Related papers (2024-12-18T12:33:50Z)
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models [63.31878920079154]
We propose a benchmark specifically designed to assess large language models' mathematical reasoning at the Olympiad level.<n>Unlike existing Olympiad-related benchmarks, our dataset focuses exclusively on mathematics and comprises a vast collection of 4428 competition-level problems with rigorous human annotation.<n>Our experimental results show that even the most advanced models, OpenAI o1-mini and OpenAI o1-preview, struggle with highly challenging Olympiad-level problems, with 60.54% and 52.55% accuracy, highlighting significant challenges in Olympiad-level mathematical reasoning.
arXiv Detail & Related papers (2024-10-10T14:39:33Z)
OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems [62.06169250463104]
We present OlympiadBench, an Olympiad-level bilingual multimodal scientific benchmark, featuring 8,476 problems from Olympiad-level mathematics and physics competitions. The best-performing model, GPT-4V, attains an average score of 17.97% on OlympiadBench, with a mere 10.74% in physics. Our analysis orienting GPT-4V points out prevalent issues with hallucinations, knowledge omissions, and logical fallacies.
arXiv Detail & Related papers (2024-02-21T18:49:26Z)
Physics simulation capabilities of LLMs [0.0]
Large Language Models (LLMs) can solve some undergraduate-level to graduate-level physics textbook problems and are proficient at coding. We present an evaluation of state-of-the-art (SOTA) LLMs on PhD-level to research-level computational physics problems.
arXiv Detail & Related papers (2023-12-04T18:06:41Z)
Using Large Language Model to Solve and Explain Physics Word Problems Approaching Human Level [0.0]
Large language model (LLM) pre-trained on texts can not only solve pure math word problems, but also physics word problems. Our work is the first research to focus on the automatic solving, explanation, and generation of physics word problems.
arXiv Detail & Related papers (2023-09-15T06:13:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.