Symbolic or Numerical? Understanding Physics Problem Solving in Reasoning LLMs
- URL: http://arxiv.org/abs/2507.01334v2
- Date: Thu, 03 Jul 2025 13:15:11 GMT
- Title: Symbolic or Numerical? Understanding Physics Problem Solving in Reasoning LLMs
- Authors: Nifu Dan, Yujun Cai, Yiwei Wang,
- Abstract summary: This study investigates the application of advanced instruction-tuned reasoning models, such as Deepseek-R1, to address a diverse spectrum of physics problems curated from the challenging SciBench benchmark.<n>Not only do they achieve state-of-the-art accuracy in answering intricate physics questions, but they also generate distinctive reasoning patterns that emphasize on symbolic derivation.
- Score: 12.215295420714787
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Navigating the complexities of physics reasoning has long been a difficult task for Large Language Models (LLMs), requiring a synthesis of profound conceptual understanding and adept problem-solving techniques. In this study, we investigate the application of advanced instruction-tuned reasoning models, such as Deepseek-R1, to address a diverse spectrum of physics problems curated from the challenging SciBench benchmark. Our comprehensive experimental evaluation reveals the remarkable capabilities of reasoning models. Not only do they achieve state-of-the-art accuracy in answering intricate physics questions, but they also generate distinctive reasoning patterns that emphasize on symbolic derivation. Furthermore, our findings indicate that even for these highly sophisticated reasoning models, the strategic incorporation of few-shot prompting can still yield measurable improvements in overall accuracy, highlighting the potential for continued performance gains.
Related papers
- ABench-Physics: Benchmarking Physical Reasoning in LLMs via High-Difficulty and Dynamic Physics Problems [21.278539804482012]
Large Language Models (LLMs) have shown impressive performance in domains such as mathematics and programming.<n>Physics poses unique challenges that demand not only precise computation but also deep conceptual understanding and physical modeling skills.<n>Existing benchmarks often fall short due to limited difficulty, multiple-choice formats, and static evaluation settings.
arXiv Detail & Related papers (2025-07-07T08:43:56Z) - PhysUniBench: An Undergraduate-Level Physics Reasoning Benchmark for Multimodal Models [69.73115077227969]
We present PhysUniBench, a large-scale benchmark designed to evaluate and improve the reasoning capabilities of large language models (MLLMs)<n>PhysUniBench consists of 3,304 physics questions spanning 8 major sub-disciplines of physics, each accompanied by one visual diagram.<n>The benchmark's construction involved a rigorous multi-stage process, including multiple roll-outs, expert-level evaluation, automated filtering of easily solved problems, and a nuanced difficulty grading system with five levels.
arXiv Detail & Related papers (2025-06-21T09:55:42Z) - Can Theoretical Physics Research Benefit from Language Agents? [50.57057488167844]
Large Language Models (LLMs) are rapidly advancing across diverse domains, yet their application in theoretical physics research is not yet mature.<n>This position paper argues that LLM agents can potentially help accelerate theoretical, computational, and applied physics when properly integrated with domain knowledge and toolbox.<n>We envision future physics-specialized LLMs that could handle multimodal data, propose testable hypotheses, and design experiments.
arXiv Detail & Related papers (2025-06-06T16:20:06Z) - PhySense: Principle-Based Physics Reasoning Benchmarking for Large Language Models [9.097623284579836]
Large language models (LLMs) have rapidly advanced and are increasingly capable of tackling complex scientific problems.<n>This discrepancy highlights a crucial gap in their ability to apply core physical principles for efficient and interpretable problem solving.<n>We introduce PhySense, a novel principle-based physics reasoning benchmark designed to be easily solvable by experts using guiding principles.
arXiv Detail & Related papers (2025-05-30T17:25:20Z) - SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning [89.48883747910448]
We present SeePhys, a large-scale multimodal benchmark for reasoning grounded in physics questions.<n>The benchmark covers 7 fundamental domains spanning the physics discipline, incorporating 21 categories of highly heterogeneous diagrams.<n>We observe that even the most advanced visual reasoning models (e.g., Gemini-2.5-pro and o4-mini) achieve sub-60% accuracy on our benchmark.
arXiv Detail & Related papers (2025-05-25T11:28:34Z) - Is the end of Insight in Sight ? [0.0]
A physics-informed neural network (PINN) trained on a rarefied gas dynamics problem governed by the Boltzmann equation.<n>Despite the system's clear structure and well-understood governing laws, the trained network's weights resemble Gaussian-distributed random matrices.<n>This suggests that deep learning and traditional simulation may follow distinct cognitive paths to the same outcome.
arXiv Detail & Related papers (2025-05-07T19:57:36Z) - Large Language Models and Mathematical Reasoning Failures [1.6114012813668932]
This paper investigates the mathematical reasoning capabilities of large language models (LLMs) using 50 newly constructed high-school-level word problems.<n>We rigorously analyze both final answers and solution steps to identify reasoning failures.<n>We find that while newer models (e.g., o3-mini, deepseek-r1) achieve higher accuracy, all models exhibit errors in spatial reasoning, strategic planning, and arithmetic.
arXiv Detail & Related papers (2025-02-17T09:07:32Z) - Causality can systematically address the monsters under the bench(marks) [64.36592889550431]
Benchmarks are plagued by various biases, artifacts, or leakage.<n>Models may behave unreliably due to poorly explored failure modes.<n> causality offers an ideal framework to systematically address these challenges.
arXiv Detail & Related papers (2025-02-07T17:01:37Z) - MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - From LSAT: The Progress and Challenges of Complex Reasoning [56.07448735248901]
We study the three challenging and domain-general tasks of the Law School Admission Test (LSAT), including analytical reasoning, logical reasoning and reading comprehension.
We propose a hybrid reasoning system to integrate these three tasks and achieve impressive overall performance on the LSAT tests.
arXiv Detail & Related papers (2021-08-02T05:43:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.