Related papers: Towards a Large Physics Benchmark

Towards a Large Physics Benchmark

URL: http://arxiv.org/abs/2507.21695v1
Date: Tue, 29 Jul 2025 11:19:00 GMT
Title: Towards a Large Physics Benchmark
Authors: Kristian G. Barman, Sascha Caron, Faegheh Hasibi, Eugene Shalugin, Yoris Marcet, Johannes Otte, Henk W. de Regt, Merijn Moody,
Abstract summary: We introduce a benchmark framework to evaluate, monitor and steer large language model development in fundamental physics.<n>We develop a scoring system in which each question is scored by an expert for its correctness, difficulty, and surprise.<n>Our current dataset contains diverse set of examples, including a machine learning challenge to classify high-energy physics events.
Score: 1.882115594816394
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: We introduce a benchmark framework developed by and for the scientific community to evaluate, monitor and steer large language model development in fundamental physics. Building on philosophical concepts of scientific understanding and creativity, we develop a scoring system in which each question is scored by an expert for its correctness, difficulty, and surprise. The questions are of three forms: (i) multiple-choice questions for conceptual understanding, (ii) analytical problems requiring mathematical derivation, and (iii) openended tasks requiring complex problem solving. Our current dataset contains diverse set of examples, including a machine learning challenge to classify high-energy physics events, such as the four top quark signal. To ensure continued relevance, we propose a living benchmark, where physicists contribute questions, for instance alongside new publications. We invite contributions via: http://www.physicsbenchmarks.org/. We hope that this benchmark will enable a targeted AI development that can make a meaningful contribution to fundamental physics research.

Related papers

Symbolic or Numerical? Understanding Physics Problem Solving in Reasoning LLMs [12.215295420714787]
This study investigates the application of advanced instruction-tuned reasoning models, such as Deepseek-R1, to address a diverse spectrum of physics problems curated from the challenging SciBench benchmark.<n>Not only do they achieve state-of-the-art accuracy in answering intricate physics questions, but they also generate distinctive reasoning patterns that emphasize on symbolic derivation.
arXiv Detail & Related papers (2025-07-02T03:51:16Z)
PhysUniBench: An Undergraduate-Level Physics Reasoning Benchmark for Multimodal Models [69.73115077227969]
We present PhysUniBench, a large-scale benchmark designed to evaluate and improve the reasoning capabilities of large language models (MLLMs)<n>PhysUniBench consists of 3,304 physics questions spanning 8 major sub-disciplines of physics, each accompanied by one visual diagram.<n>The benchmark's construction involved a rigorous multi-stage process, including multiple roll-outs, expert-level evaluation, automated filtering of easily solved problems, and a nuanced difficulty grading system with five levels.
arXiv Detail & Related papers (2025-06-21T09:55:42Z)
Can Theoretical Physics Research Benefit from Language Agents? [50.57057488167844]
Large Language Models (LLMs) are rapidly advancing across diverse domains, yet their application in theoretical physics research is not yet mature.<n>This position paper argues that LLM agents can potentially help accelerate theoretical, computational, and applied physics when properly integrated with domain knowledge and toolbox.<n>We envision future physics-specialized LLMs that could handle multimodal data, propose testable hypotheses, and design experiments.
arXiv Detail & Related papers (2025-06-06T16:20:06Z)
SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning [89.48883747910448]
We present SeePhys, a large-scale multimodal benchmark for reasoning grounded in physics questions.<n>The benchmark covers 7 fundamental domains spanning the physics discipline, incorporating 21 categories of highly heterogeneous diagrams.<n>We observe that even the most advanced visual reasoning models (e.g., Gemini-2.5-pro and o4-mini) achieve sub-60% accuracy on our benchmark.
arXiv Detail & Related papers (2025-05-25T11:28:34Z)
PhysicsArena: The First Multimodal Physics Reasoning Benchmark Exploring Variable, Process, and Solution Dimensions [9.428916253383402]
PhysicsArena aims to provide a comprehensive platform for assessing and advancing the multimodal physics reasoning abilities of MLLMs.<n> MLLMs have demonstrated remarkable capabilities in diverse reasoning tasks, yet their application to complex physics reasoning remains underexplored.
arXiv Detail & Related papers (2025-05-21T12:48:16Z)
PHYSICS: Benchmarking Foundation Models on University-Level Physics Problem Solving [38.44445350202585]
We introduce PHYSICS, a comprehensive benchmark for university-level physics problem solving.<n>It contains 1297 expert-annotated problems covering six core areas: classical mechanics, quantum mechanics, thermodynamics and statistical mechanics, electromagnetism, atomic physics, and optics.
arXiv Detail & Related papers (2025-03-26T06:21:56Z)
Theoretical Physics Benchmark (TPBench) -- a Dataset and Study of AI Reasoning Capabilities in Theoretical Physics [13.530403536762064]
We introduce a benchmark to evaluate the capability of AI to solve problems in theoretical physics, focusing on high-energy theory and cosmology.<n>The first iteration of our benchmark consists of 57 problems of varying difficulty, from undergraduate to research level.<n>We evaluate our data set on various open and closed language models, including o3-mini, o1, DeepSeek-R1, GPT-4o and versions of Llama and Qwen.
arXiv Detail & Related papers (2025-02-19T19:00:00Z)
Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation [51.750634349748736]
Text-to-video (T2V) models have made significant strides in visualizing complex prompts. However, the capacity of these models to accurately represent intuitive physics remains largely unexplored. We introduce PhyGenBench to evaluate physical commonsense correctness in T2V generation.
arXiv Detail & Related papers (2024-10-07T17:56:04Z)
ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life Videos [53.92440577914417]
ACQUIRED consists of 3.9K annotated videos, encompassing a wide range of event types and incorporating both first and third-person viewpoints. Each video is annotated with questions that span three distinct dimensions of reasoning, including physical, social, and temporal. We benchmark our dataset against several state-of-the-art language-only and multimodal models and experimental results demonstrate a significant performance gap.
arXiv Detail & Related papers (2023-11-02T22:17:03Z)
Towards a Holistic Understanding of Mathematical Questions with Contrastive Pre-training [65.10741459705739]
We propose a novel contrastive pre-training approach for mathematical question representations, namely QuesCo. We first design two-level question augmentations, including content-level and structure-level, which generate literally diverse question pairs with similar purposes. Then, to fully exploit hierarchical information of knowledge concepts, we propose a knowledge hierarchy-aware rank strategy.
arXiv Detail & Related papers (2023-01-18T14:23:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.