Related papers: Perfect score on IPhO 2025 theory by Gemini agent

Perfect score on IPhO 2025 theory by Gemini agent

URL: http://arxiv.org/abs/2603.03352v1
Date: Thu, 26 Feb 2026 18:53:05 GMT
Title: Perfect score on IPhO 2025 theory by Gemini agent
Authors: Yichen Huang,
Abstract summary: The International Physics Olympiad (IPhO) is the world's most prestigious and renowned physics competition for pre-university students.<n>On IPhO 2025 theory problems, while gold medal performance by AI models was reported previously, it falls behind the best human contestant.<n>Here we build a simple agent with Gemini 3.1 Pro Preview.
Score: 5.634825161148485
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The International Physics Olympiad (IPhO) is the world's most prestigious and renowned physics competition for pre-university students. IPhO problems require complex reasoning based on deep understanding of physical principles in a standard general physics curriculum. On IPhO 2025 theory problems, while gold medal performance by AI models was reported previously, it falls behind the best human contestant. Here we build a simple agent with Gemini 3.1 Pro Preview. We run it five times and it achieved a perfect score every time. However, data contamination could occur because Gemini 3.1 Pro Preview was released after the competition.

Related papers

P1: Mastering Physics Olympiads with Reinforcement Learning [84.08897284032724]
We introduce P1, a family of open-source physics reasoning models trained entirely through reinforcement learning (RL)<n>P1-235B-A22B is the first open-source model with Gold-medal performance at the latest International Physics Olympiad (IPhO 2025), and wins 12 gold medals out of 13 international/regional physics competitions in 2024/2025.<n>P1-235B-A22B+PhysicsMinions achieves overall No.1 on IPhO 2025, and obtains the highest average score over the 13 physics competitions.
arXiv Detail & Related papers (2025-11-17T17:18:13Z)
LOCA-R: Near-Perfect Performance on the Chinese Physics Olympiad 2025 [3.5580730009417016]
We introduce LOCA-R (LOgical Chain Augmentation for Reasoning), an improved version of the LOCA framework adapted for complex reasoning.<n>LOCA-R achieves a near-perfect score of 313 out of 320 points, solidly surpassing the highest-scoring human competitor.
arXiv Detail & Related papers (2025-11-13T17:20:46Z)
PhysicsMinions: Winning Gold Medals in the Latest Physics Olympiads with a Coevolutionary Multimodal Multi-Agent System [65.02248709992442]
Physics is central to understanding and shaping the real world, and the ability to solve physics problems is a key indicator of real-world physical intelligence.<n>Existing approaches are predominantly single-model based, and open-source MLLMs rarely reach gold-medal-level performance.<n>We propose PhysicsMinions, a coevolutionary multi-agent system for Physics Olympiad.<n>Its architecture features three synergistic studios: a Visual Studio to interpret diagrams, a Logic Studio to formulate solutions, and a Review Studio to perform dual-stage verification.
arXiv Detail & Related papers (2025-09-29T14:40:53Z)
HiPhO: How Far Are (M)LLMs from Humans in the Latest High School Physics Olympiad Benchmark? [53.76627321546095]
HiPhO is the first benchmark dedicated to high school physics Olympiads with human-aligned evaluation.<n>It compiles 13 latest Olympiad exams from 2024-2025, spanning both international and regional competitions.<n>We assign gold, silver, and bronze medals to models based on official medal thresholds, thereby enabling direct comparison between (M)LLMs and human contestants.
arXiv Detail & Related papers (2025-09-09T16:24:51Z)
Physics Supernova: AI Agent Matches Elite Gold Medalists at IPhO 2025 [55.8464246603186]
We introduce Physics Supernova, an AI system with superior physics problem-solving abilities.<n>Supernova attains 23.5/30 points, ranking 14th of 406 contestants and surpassing the median performance of human gold medalists.<n>These results show that principled tool integration within agent systems can deliver competitive improvements.
arXiv Detail & Related papers (2025-09-01T17:59:13Z)
Winning Gold at IMO 2025 with a Model-Agnostic Verification-and-Refinement Pipeline [10.177917426690703]
Large language models often struggle with Olympiad-level problems.<n>We construct a model-agnostic, verification-and-refinement pipeline.<n>We demonstrate its effectiveness on the recent IMO 2025.
arXiv Detail & Related papers (2025-07-21T17:59:49Z)
PhysUniBench: An Undergraduate-Level Physics Reasoning Benchmark for Multimodal Models [69.73115077227969]
We present PhysUniBench, a large-scale benchmark designed to evaluate and improve the reasoning capabilities of large language models (MLLMs)<n>PhysUniBench consists of 3,304 physics questions spanning 8 major sub-disciplines of physics, each accompanied by one visual diagram.<n>The benchmark's construction involved a rigorous multi-stage process, including multiple roll-outs, expert-level evaluation, automated filtering of easily solved problems, and a nuanced difficulty grading system with five levels.
arXiv Detail & Related papers (2025-06-21T09:55:42Z)
OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems [62.06169250463104]
We present OlympiadBench, an Olympiad-level bilingual multimodal scientific benchmark, featuring 8,476 problems from Olympiad-level mathematics and physics competitions. The best-performing model, GPT-4V, attains an average score of 17.97% on OlympiadBench, with a mere 10.74% in physics. Our analysis orienting GPT-4V points out prevalent issues with hallucinations, knowledge omissions, and logical fallacies.
arXiv Detail & Related papers (2024-02-21T18:49:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.