Benchmarking Foundation Models with Retrieval-Augmented Generation in Olympic-Level Physics Problem Solving
- URL: http://arxiv.org/abs/2510.00919v2
- Date: Thu, 02 Oct 2025 09:55:14 GMT
- Title: Benchmarking Foundation Models with Retrieval-Augmented Generation in Olympic-Level Physics Problem Solving
- Authors: Shunfeng Zheng, Yudi Zhang, Meng Fang, Zihan Zhang, Zhitan Wu, Mykola Pechenizkiy, Ling Chen,
- Abstract summary: Retrieval-augmented generation (RAG) with foundation models has achieved strong performance across diverse tasks.<n>But their capacity for expert-level reasoning-such as solving Olympiad-level physics problems-remains largely unexplored.<n>We introduce PhoPile, a high-quality multimodal dataset specifically designed for Olympiad-level physics.<n>Using PhoPile, we benchmark RAG-augmented foundation models, covering both large language models (LLMs) and large multimodal models (LMMs) with multiple retrievers.
- Score: 56.119382216818195
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Retrieval-augmented generation (RAG) with foundation models has achieved strong performance across diverse tasks, but their capacity for expert-level reasoning-such as solving Olympiad-level physics problems-remains largely unexplored. Inspired by the way students prepare for competitions by reviewing past problems, we investigate the potential of RAG to enhance physics reasoning in foundation models. We introduce PhoPile, a high-quality multimodal dataset specifically designed for Olympiad-level physics, enabling systematic study of retrieval-based reasoning. PhoPile includes diagrams, graphs, and equations, capturing the inherently multimodal nature of physics problem solving. Using PhoPile, we benchmark RAG-augmented foundation models, covering both large language models (LLMs) and large multimodal models (LMMs) with multiple retrievers. Our results demonstrate that integrating retrieval with physics corpora can improve model performance, while also highlighting challenges that motivate further research in retrieval-augmented physics reasoning.
Related papers
- Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark [49.42250115889234]
We present the first benchmark designed to test large language models (LLMs) on research-level reasoning tasks.<n>CritPt consists of 71 composite research challenges designed to simulate full-scale research projects at the entry level.<n>We find that while current state-of-the-art LLMs show early promise on isolated checkpoints, they remain far from being able to reliably solve full research-scale challenges.
arXiv Detail & Related papers (2025-09-30T17:34:03Z) - Symbolic or Numerical? Understanding Physics Problem Solving in Reasoning LLMs [12.215295420714787]
This study investigates the application of advanced instruction-tuned reasoning models, such as Deepseek-R1, to address a diverse spectrum of physics problems curated from the challenging SciBench benchmark.<n>Not only do they achieve state-of-the-art accuracy in answering intricate physics questions, but they also generate distinctive reasoning patterns that emphasize on symbolic derivation.
arXiv Detail & Related papers (2025-07-02T03:51:16Z) - PhysUniBench: An Undergraduate-Level Physics Reasoning Benchmark for Multimodal Models [69.73115077227969]
We present PhysUniBench, a large-scale benchmark designed to evaluate and improve the reasoning capabilities of large language models (MLLMs)<n>PhysUniBench consists of 3,304 physics questions spanning 8 major sub-disciplines of physics, each accompanied by one visual diagram.<n>The benchmark's construction involved a rigorous multi-stage process, including multiple roll-outs, expert-level evaluation, automated filtering of easily solved problems, and a nuanced difficulty grading system with five levels.
arXiv Detail & Related papers (2025-06-21T09:55:42Z) - Can Theoretical Physics Research Benefit from Language Agents? [50.57057488167844]
Large Language Models (LLMs) are rapidly advancing across diverse domains, yet their application in theoretical physics research is not yet mature.<n>This position paper argues that LLM agents can potentially help accelerate theoretical, computational, and applied physics when properly integrated with domain knowledge and toolbox.<n>We envision future physics-specialized LLMs that could handle multimodal data, propose testable hypotheses, and design experiments.
arXiv Detail & Related papers (2025-06-06T16:20:06Z) - SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning [95.2886065291234]
We present SeePhys, a large-scale multimodal benchmark for reasoning grounded in physics questions.<n>The benchmark covers 7 fundamental domains spanning the physics discipline, incorporating 21 categories of highly heterogeneous diagrams.<n>We observe that even the most advanced visual reasoning models (e.g., Gemini-2.5-pro and o4-mini) achieve sub-60% accuracy on our benchmark.
arXiv Detail & Related papers (2025-05-25T11:28:34Z) - PHYSICS: Benchmarking Foundation Models on University-Level Physics Problem Solving [38.44445350202585]
We introduce PHYSICS, a comprehensive benchmark for university-level physics problem solving.<n>It contains 1297 expert-annotated problems covering six core areas: classical mechanics, quantum mechanics, thermodynamics and statistical mechanics, electromagnetism, atomic physics, and optics.
arXiv Detail & Related papers (2025-03-26T06:21:56Z) - Large Physics Models: Towards a collaborative approach with Large Language Models and Foundation Models [8.320153035338418]
This paper explores ideas and provides a potential roadmap for the development and evaluation of physics-specific large-scale AI models.<n>These models, based on foundation models such as Large Language Models (LLMs) are tailored to address the demands of physics research.
arXiv Detail & Related papers (2025-01-09T17:11:22Z) - Enhancing LLMs for Physics Problem-Solving using Reinforcement Learning with Human-AI Feedback [33.000541253136745]
Large Language Models (LLMs) have demonstrated strong capabilities in text-based tasks but struggle with the complex reasoning required for physics problems.<n>This paper presents a novel approach to improving LLM performance on physics questions using Reinforcement Learning with Human and Artificial Intelligence Feedback (RLHAIF)
arXiv Detail & Related papers (2024-12-06T21:17:47Z) - OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems [62.06169250463104]
We present OlympiadBench, an Olympiad-level bilingual multimodal scientific benchmark, featuring 8,476 problems from Olympiad-level mathematics and physics competitions.
The best-performing model, GPT-4V, attains an average score of 17.97% on OlympiadBench, with a mere 10.74% in physics.
Our analysis orienting GPT-4V points out prevalent issues with hallucinations, knowledge omissions, and logical fallacies.
arXiv Detail & Related papers (2024-02-21T18:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.