Related papers: HiPhO: How Far Are (M)LLMs from Humans in the Latest High School Physics Olympiad Benchmark?

HiPhO: How Far Are (M)LLMs from Humans in the Latest High School Physics Olympiad Benchmark?

URL: http://arxiv.org/abs/2509.07894v4
Date: Fri, 19 Sep 2025 16:18:35 GMT
Title: HiPhO: How Far Are (M)LLMs from Humans in the Latest High School Physics Olympiad Benchmark?
Authors: Fangchen Yu, Haiyuan Wan, Qianjia Cheng, Yuchen Zhang, Jiacheng Chen, Fujun Han, Yulun Wu, Junchi Yao, Ruilizhen Hu, Ning Ding, Yu Cheng, Tao Chen, Lei Bai, Dongzhan Zhou, Yun Luo, Ganqu Cui, Peng Ye,
Abstract summary: HiPhO is the first benchmark dedicated to high school physics Olympiads with human-aligned evaluation.<n>It compiles 13 latest Olympiad exams from 2024-2025, spanning both international and regional competitions.<n>We assign gold, silver, and bronze medals to models based on official medal thresholds, thereby enabling direct comparison between (M)LLMs and human contestants.
Score: 53.76627321546095
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recently, the physical capabilities of (M)LLMs have garnered increasing attention. However, existing benchmarks for physics suffer from two major gaps: they neither provide systematic and up-to-date coverage of real-world physics competitions such as physics Olympiads, nor enable direct performance comparison with humans. To bridge these gaps, we present HiPhO, the first benchmark dedicated to high school physics Olympiads with human-aligned evaluation. Specifically, HiPhO highlights three key innovations. (1) Comprehensive Data: It compiles 13 latest Olympiad exams from 2024-2025, spanning both international and regional competitions, and covering mixed modalities that encompass problems spanning text-only to diagram-based. (2) Professional Evaluation: We adopt official marking schemes to perform fine-grained grading at both the answer and step level, fully aligned with human examiners to ensure high-quality and domain-specific evaluation. (3) Comparison with Human Contestants: We assign gold, silver, and bronze medals to models based on official medal thresholds, thereby enabling direct comparison between (M)LLMs and human contestants. Our large-scale evaluation of 30 state-of-the-art (M)LLMs shows that: across 13 exams, open-source MLLMs mostly remain at or below the bronze level; open-source LLMs show promising progress with multiple golds; closed-source reasoning MLLMs can achieve 6 to 12 gold medals; and most models still have a significant gap from full marks. These results highlight the performance gap between open-source models and top students, the strong reasoning abilities of closed-source models, and the remaining room for improvement. HiPhO, a human-aligned Olympiad benchmark for multimodal physical reasoning, is open-source at https://github.com/SciYu/HiPhO with a public leaderboard at https://phyarena.github.io/.

Related papers

P1: Mastering Physics Olympiads with Reinforcement Learning [84.08897284032724]
We introduce P1, a family of open-source physics reasoning models trained entirely through reinforcement learning (RL)<n>P1-235B-A22B is the first open-source model with Gold-medal performance at the latest International Physics Olympiad (IPhO 2025), and wins 12 gold medals out of 13 international/regional physics competitions in 2024/2025.<n>P1-235B-A22B+PhysicsMinions achieves overall No.1 on IPhO 2025, and obtains the highest average score over the 13 physics competitions.
arXiv Detail & Related papers (2025-11-17T17:18:13Z)
PhysicsMinions: Winning Gold Medals in the Latest Physics Olympiads with a Coevolutionary Multimodal Multi-Agent System [65.02248709992442]
Physics is central to understanding and shaping the real world, and the ability to solve physics problems is a key indicator of real-world physical intelligence.<n>Existing approaches are predominantly single-model based, and open-source MLLMs rarely reach gold-medal-level performance.<n>We propose PhysicsMinions, a coevolutionary multi-agent system for Physics Olympiad.<n>Its architecture features three synergistic studios: a Visual Studio to interpret diagrams, a Logic Studio to formulate solutions, and a Review Studio to perform dual-stage verification.
arXiv Detail & Related papers (2025-09-29T14:40:53Z)
EEFSUVA: A New Mathematical Olympiad Benchmark [1.7589620883907298]
We examine claims that large language models (LLMs) match gold medal Olympiad to graduate level proficiency on mathematics benchmarks.<n>We introduce EEFSUVA, a novel benchmark curated from under circulated regional and national Olympiads of Eastern Europe and the countries from the former Soviet Union.<n>Preliminary results suggest that even state-of-the-art LLMs exhibit a notable performance decline on EEFSUVA relative to other Olympiad-style benchmarks.
arXiv Detail & Related papers (2025-09-23T01:57:56Z)
RIMO: An Easy-to-Evaluate, Hard-to-Solve Olympiad Benchmark for Advanced Mathematical Reasoning [26.173204350710833]
RIMO is a two-track benchmark designed to preserve peak Olympiad difficulty while eliminating evaluation noise.<n>The first track, RIMO-N, rewrites 335 problems to admit a single, unique integer answer, allowing for deterministic correctness checking.<n>The second track, RIMO-P, features 456 proof problems with expert-checked solutions, which are decomposed into a sequence of sub-problems to evaluate the step-by-step reasoning process.
arXiv Detail & Related papers (2025-09-09T13:13:51Z)
CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings [70.95565672516979]
Existing benchmarks, like LiveCodeBench and USACO, fall short due to the unavailability of private test cases, lack of support for special judges, and misaligned execution environments.<n>CodeElo is a standardized competition-level code generation benchmark that effectively addresses all these challenges for the first time.
arXiv Detail & Related papers (2025-01-02T13:49:00Z)
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models [63.31878920079154]
We propose a benchmark specifically designed to assess large language models' mathematical reasoning at the Olympiad level.<n>Unlike existing Olympiad-related benchmarks, our dataset focuses exclusively on mathematics and comprises a vast collection of 4428 competition-level problems with rigorous human annotation.<n>Our experimental results show that even the most advanced models, OpenAI o1-mini and OpenAI o1-preview, struggle with highly challenging Olympiad-level problems, with 60.54% and 52.55% accuracy, highlighting significant challenges in Olympiad-level mathematical reasoning.
arXiv Detail & Related papers (2024-10-10T14:39:33Z)
OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems [62.06169250463104]
We present OlympiadBench, an Olympiad-level bilingual multimodal scientific benchmark, featuring 8,476 problems from Olympiad-level mathematics and physics competitions. The best-performing model, GPT-4V, attains an average score of 17.97% on OlympiadBench, with a mere 10.74% in physics. Our analysis orienting GPT-4V points out prevalent issues with hallucinations, knowledge omissions, and logical fallacies.
arXiv Detail & Related papers (2024-02-21T18:49:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.