Theoretical Physics Benchmark (TPBench) -- a Dataset and Study of AI Reasoning Capabilities in Theoretical Physics
- URL: http://arxiv.org/abs/2502.15815v1
- Date: Wed, 19 Feb 2025 19:00:00 GMT
- Title: Theoretical Physics Benchmark (TPBench) -- a Dataset and Study of AI Reasoning Capabilities in Theoretical Physics
- Authors: Daniel J. H. Chung, Zhiqi Gao, Yurii Kvasiuk, Tianyi Li, Moritz Münchmeyer, Maja Rudolph, Frederic Sala, Sai Chaitanya Tadepalli,
- Abstract summary: We introduce a benchmark to evaluate the capability of AI to solve problems in theoretical physics, focusing on high-energy theory and cosmology.<n>The first iteration of our benchmark consists of 57 problems of varying difficulty, from undergraduate to research level.<n>We evaluate our data set on various open and closed language models, including o3-mini, o1, DeepSeek-R1, GPT-4o and versions of Llama and Qwen.
- Score: 13.530403536762064
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce a benchmark to evaluate the capability of AI to solve problems in theoretical physics, focusing on high-energy theory and cosmology. The first iteration of our benchmark consists of 57 problems of varying difficulty, from undergraduate to research level. These problems are novel in the sense that they do not come from public problem collections. We evaluate our data set on various open and closed language models, including o3-mini, o1, DeepSeek-R1, GPT-4o and versions of Llama and Qwen. While we find impressive progress in model performance with the most recent models, our research-level difficulty problems are mostly unsolved. We address challenges of auto-verifiability and grading, and discuss common failure modes. While currently state-of-the art models are still of limited use for researchers, our results show that AI assisted theoretical physics research may become possible in the near future. We discuss the main obstacles towards this goal and possible strategies to overcome them. The public problems and solutions, results for various models, and updates to the data set and score distribution, are available on the website of the dataset tpbench.org.
Related papers
- PhysicsEval: Inference-Time Techniques to Improve the Reasoning Proficiency of Large Language Models on Physics Problems [3.0901186959880977]
We evaluate the performance of frontier LLMs in solving physics problems, both mathematical and descriptive.<n>We introduce a new evaluation benchmark for physics problems, $rm Psmall HYSICSEsmall VAL$, consisting of 19,609 problems sourced from various physics textbooks.
arXiv Detail & Related papers (2025-07-31T18:12:51Z) - Towards a Large Physics Benchmark [1.882115594816394]
We introduce a benchmark framework to evaluate, monitor and steer large language model development in fundamental physics.<n>We develop a scoring system in which each question is scored by an expert for its correctness, difficulty, and surprise.<n>Our current dataset contains diverse set of examples, including a machine learning challenge to classify high-energy physics events.
arXiv Detail & Related papers (2025-07-29T11:19:00Z) - FormulaOne: Measuring the Depth of Algorithmic Reasoning Beyond Competitive Programming [19.576944188747166]
FormulaOne is a benchmark for graph theory, logic, and algorithms.<n>Our problems are incredibly demanding, requiring an array of reasoning steps.<n>Remarkably, state-of-the-art models like OpenAI's o3 fail entirely on FormulaOne.
arXiv Detail & Related papers (2025-07-17T17:53:55Z) - PhysUniBench: An Undergraduate-Level Physics Reasoning Benchmark for Multimodal Models [69.73115077227969]
We present PhysUniBench, a large-scale benchmark designed to evaluate and improve the reasoning capabilities of large language models (MLLMs)<n>PhysUniBench consists of 3,304 physics questions spanning 8 major sub-disciplines of physics, each accompanied by one visual diagram.<n>The benchmark's construction involved a rigorous multi-stage process, including multiple roll-outs, expert-level evaluation, automated filtering of easily solved problems, and a nuanced difficulty grading system with five levels.
arXiv Detail & Related papers (2025-06-21T09:55:42Z) - Scaling Physical Reasoning with the PHYSICS Dataset [32.956687630330116]
PHYSICS is a dataset containing 16,568 high-quality physics problems spanning subjects and difficulty levels.<n>It covers five major physics domains: Mechanics, Electromagnetism, Thermodynamics, Optics, and Modern Physics.<n>It also spans a wide range of difficulty levels, from high school to graduate-level physics courses.
arXiv Detail & Related papers (2025-05-21T17:06:28Z) - PHYSICS: Benchmarking Foundation Models on University-Level Physics Problem Solving [38.44445350202585]
We introduce PHYSICS, a comprehensive benchmark for university-level physics problem solving.<n>It contains 1297 expert-annotated problems covering six core areas: classical mechanics, quantum mechanics, thermodynamics and statistical mechanics, electromagnetism, atomic physics, and optics.
arXiv Detail & Related papers (2025-03-26T06:21:56Z) - PromptCoT: Synthesizing Olympiad-level Problems for Mathematical Reasoning in Large Language Models [59.920971312822736]
We introduce PromptCoT, a novel approach for automatically generating high-quality Olympiad-level math problems.
The proposed method synthesizes complex problems based on mathematical concepts and the rationale behind problem construction.
Our method is evaluated on standard benchmarks including GSM8K, MATH-500, and AIME2024, where it consistently outperforms existing problem generation methods.
arXiv Detail & Related papers (2025-03-04T06:32:30Z) - PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning [36.193595420239845]
We present PhysReason, a 1,200-problem benchmark for evaluating large language models.<n>Problems require an average of 8.1 solution steps, with hard requiring 15.6.<n>Top-performing models like Deepseek-R1, Gemini-2.0-Flash-Thinking, and o3-mini-high achieve less than 60% on answer-level evaluation.
arXiv Detail & Related papers (2025-02-17T17:24:14Z) - MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations [90.07275414500154]
We observe significant performance drops on MATH-P-Hard across various models.<n>We also raise concerns about a novel form of memorization where models blindly apply learned problem-solving skills.
arXiv Detail & Related papers (2025-02-10T13:31:46Z) - Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs [76.43407125275202]
o1-like models can emulate human-like long-time thinking during inference.<n>This paper presents the first comprehensive study on the prevalent issue of overthinking in these models.<n>We propose strategies to mitigate overthinking, streamlining reasoning processes without compromising accuracy.
arXiv Detail & Related papers (2024-12-30T18:55:12Z) - Explore Theory of Mind: Program-guided adversarial data generation for theory of mind reasoning [88.68573198200698]
We introduce ExploreToM, the first framework to allow large-scale generation of diverse and challenging theory of mind data.<n>Our approach leverages an A* search over a custom domain-specific language to produce complex story structures and novel, diverse, yet plausible scenarios.<n>Our evaluation reveals that state-of-the-art LLMs, such as Llama-3.1-70B and GPT-4o, show accuracies as low as 0% and 9% on ExploreToM-generated data.
arXiv Detail & Related papers (2024-12-12T21:29:00Z) - OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI [73.75520820608232]
We introduce OlympicArena, which includes 11,163 bilingual problems across both text-only and interleaved text-image modalities.
These challenges encompass a wide range of disciplines spanning seven fields and 62 international Olympic competitions, rigorously examined for data leakage.
Our evaluations reveal that even advanced models like GPT-4o only achieve a 39.97% overall accuracy, illustrating current AI limitations in complex reasoning and multimodal integration.
arXiv Detail & Related papers (2024-06-18T16:20:53Z) - Physics simulation capabilities of LLMs [0.0]
Large Language Models (LLMs) can solve some undergraduate-level to graduate-level physics textbook problems and are proficient at coding.
We present an evaluation of state-of-the-art (SOTA) LLMs on PhD-level to research-level computational physics problems.
arXiv Detail & Related papers (2023-12-04T18:06:41Z) - ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life
Videos [53.92440577914417]
ACQUIRED consists of 3.9K annotated videos, encompassing a wide range of event types and incorporating both first and third-person viewpoints.
Each video is annotated with questions that span three distinct dimensions of reasoning, including physical, social, and temporal.
We benchmark our dataset against several state-of-the-art language-only and multimodal models and experimental results demonstrate a significant performance gap.
arXiv Detail & Related papers (2023-11-02T22:17:03Z) - How Much Coffee Was Consumed During EMNLP 2019? Fermi Problems: A New
Reasoning Challenge for AI [32.54495599722743]
We propose a new reasoning challenge, namely Fermi Problems (FPs)
FPs are questions whose answers can only be approximately estimated because their precise computation is either impractical or impossible.
We present two datasets: 1) A collection of 1k real-world FPs sourced from quizzes and olympiads; and 2) a bank of 10k synthetic FPs of intermediate complexity to serve as a sandbox for the harder real-world challenge.
arXiv Detail & Related papers (2021-10-27T06:39:33Z) - Living in the Physics and Machine Learning Interplay for Earth
Observation [7.669855697331746]
Inferences mean understanding variables relations, deriving models that are physically interpretable.
Machine learning models alone are excellent approximators, but very often do not respect the most elementary laws of physics.
This is a collective long-term AI agenda towards developing and applying algorithms capable of discovering knowledge in the Earth system.
arXiv Detail & Related papers (2020-10-18T16:58:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.