PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning
- URL: http://arxiv.org/abs/2502.12054v1
- Date: Mon, 17 Feb 2025 17:24:14 GMT
- Title: PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning
- Authors: Xinyu Zhang, Yuxuan Dong, Yanrui Wu, Jiaxing Huang, Chengyou Jia, Basura Fernando, Mike Zheng Shou, Lingling Zhang, Jun Liu,
- Abstract summary: We present PhysReason, a 1,200-problem benchmark for evaluating large language models.<n>Problems require an average of 8.1 solution steps, with hard requiring 15.6.<n>Top-performing models like Deepseek-R1, Gemini-2.0-Flash-Thinking, and o3-mini-high achieve less than 60% on answer-level evaluation.
- Score: 36.193595420239845
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large language models demonstrate remarkable capabilities across various domains, especially mathematics and logic reasoning. However, current evaluations overlook physics-based reasoning - a complex task requiring physics theorems and constraints. We present PhysReason, a 1,200-problem benchmark comprising knowledge-based (25%) and reasoning-based (75%) problems, where the latter are divided into three difficulty levels (easy, medium, hard). Notably, problems require an average of 8.1 solution steps, with hard requiring 15.6, reflecting the complexity of physics-based reasoning. We propose the Physics Solution Auto Scoring Framework, incorporating efficient answer-level and comprehensive step-level evaluations. Top-performing models like Deepseek-R1, Gemini-2.0-Flash-Thinking, and o3-mini-high achieve less than 60% on answer-level evaluation, with performance dropping from knowledge questions (75.11%) to hard problems (31.95%). Through step-level evaluation, we identified four key bottlenecks: Physics Theorem Application, Physics Process Understanding, Calculation, and Physics Condition Analysis. These findings position PhysReason as a novel and comprehensive benchmark for evaluating physics-based reasoning capabilities in large language models. Our code and data will be published at https:/dxzxy12138.github.io/PhysReason.
Related papers
- PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models [30.597050689757605]
PHYBench is a benchmark for evaluating reasoning capabilities of large language models (LLMs) in physical contexts.
It consists of 500 physics problems based on real-world physical scenarios, covering mechanics, electromagnetism, thermodynamics, optics, modern physics, and advanced physics.
We also propose the Expression Edit Distance (EED) Score, a novel evaluation metric based on the edit distance between mathematical expressions.
arXiv Detail & Related papers (2025-04-22T17:53:29Z) - Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models [86.45058529521258]
OlymMATH is a novel Olympiad-level mathematical benchmark designed to rigorously test the complex reasoning capabilities of LLMs.
OlymMATH features 200 meticulously curated problems, each manually verified and available in parallel English and Chinese versions.
arXiv Detail & Related papers (2025-03-27T11:20:17Z) - Theoretical Physics Benchmark (TPBench) -- a Dataset and Study of AI Reasoning Capabilities in Theoretical Physics [13.530403536762064]
We introduce a benchmark to evaluate the capability of AI to solve problems in theoretical physics, focusing on high-energy theory and cosmology.
The first iteration of our benchmark consists of 57 problems of varying difficulty, from undergraduate to research level.
We evaluate our data set on various open and closed language models, including o3-mini, o1, DeepSeek-R1, GPT-4o and versions of Llama and Qwen.
arXiv Detail & Related papers (2025-02-19T19:00:00Z) - UGPhysics: A Comprehensive Benchmark for Undergraduate Physics Reasoning with Large Language Models [39.917074900737575]
Large language models (LLMs) have demonstrated remarkable capabilities in solving complex reasoning tasks.<n>The domain of physics reasoning presents unique challenges that have received significantly less attention.<n>Existing benchmarks often fall short in evaluating LLMs' abilities on the breadth and depth of undergraduate-level physics.
arXiv Detail & Related papers (2025-02-01T06:42:02Z) - Physics Reasoner: Knowledge-Augmented Reasoning for Solving Physics Problems with Large Language Models [41.88825441287559]
Existing large language models (LLMs) frequently fail due to a lack of knowledge or incorrect knowledge application.<n>We propose Physics Reasoner, a knowledge-augmented framework to solve physics problems with LLMs.<n>Given a physics problem, Physics Reasoner solves it through three stages: problem analysis, formula retrieval, and guided reasoning.<n> Empirically, Physics Reasoner mitigates the issues of insufficient knowledge and incorrect application, achieving state-of-the-art performance on SciBench with an average accuracy improvement of 5.8%.
arXiv Detail & Related papers (2024-12-18T12:33:50Z) - Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models [63.31878920079154]
We propose a benchmark specifically designed to assess large language models' mathematical reasoning at the Olympiad level.<n>Unlike existing Olympiad-related benchmarks, our dataset focuses exclusively on mathematics and comprises a vast collection of 4428 competition-level problems with rigorous human annotation.<n>Our experimental results show that even the most advanced models, OpenAI o1-mini and OpenAI o1-preview, struggle with highly challenging Olympiad-level problems, with 60.54% and 52.55% accuracy, highlighting significant challenges in Olympiad-level mathematical reasoning.
arXiv Detail & Related papers (2024-10-10T14:39:33Z) - OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems [62.06169250463104]
We present OlympiadBench, an Olympiad-level bilingual multimodal scientific benchmark, featuring 8,476 problems from Olympiad-level mathematics and physics competitions.
The best-performing model, GPT-4V, attains an average score of 17.97% on OlympiadBench, with a mere 10.74% in physics.
Our analysis orienting GPT-4V points out prevalent issues with hallucinations, knowledge omissions, and logical fallacies.
arXiv Detail & Related papers (2024-02-21T18:49:26Z) - Physics simulation capabilities of LLMs [0.0]
Large Language Models (LLMs) can solve some undergraduate-level to graduate-level physics textbook problems and are proficient at coding.
We present an evaluation of state-of-the-art (SOTA) LLMs on PhD-level to research-level computational physics problems.
arXiv Detail & Related papers (2023-12-04T18:06:41Z) - Using Large Language Model to Solve and Explain Physics Word Problems
Approaching Human Level [0.0]
Large language model (LLM) pre-trained on texts can not only solve pure math word problems, but also physics word problems.
Our work is the first research to focus on the automatic solving, explanation, and generation of physics word problems.
arXiv Detail & Related papers (2023-09-15T06:13:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.