Perfect score on IPhO 2025 theory by Gemini agent
- URL: http://arxiv.org/abs/2603.03352v1
- Date: Thu, 26 Feb 2026 18:53:05 GMT
- Title: Perfect score on IPhO 2025 theory by Gemini agent
- Authors: Yichen Huang,
- Abstract summary: The International Physics Olympiad (IPhO) is the world's most prestigious and renowned physics competition for pre-university students.<n>On IPhO 2025 theory problems, while gold medal performance by AI models was reported previously, it falls behind the best human contestant.<n>Here we build a simple agent with Gemini 3.1 Pro Preview.
- Score: 5.634825161148485
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The International Physics Olympiad (IPhO) is the world's most prestigious and renowned physics competition for pre-university students. IPhO problems require complex reasoning based on deep understanding of physical principles in a standard general physics curriculum. On IPhO 2025 theory problems, while gold medal performance by AI models was reported previously, it falls behind the best human contestant. Here we build a simple agent with Gemini 3.1 Pro Preview. We run it five times and it achieved a perfect score every time. However, data contamination could occur because Gemini 3.1 Pro Preview was released after the competition.
Related papers
- P1: Mastering Physics Olympiads with Reinforcement Learning [84.08897284032724]
We introduce P1, a family of open-source physics reasoning models trained entirely through reinforcement learning (RL)<n>P1-235B-A22B is the first open-source model with Gold-medal performance at the latest International Physics Olympiad (IPhO 2025), and wins 12 gold medals out of 13 international/regional physics competitions in 2024/2025.<n>P1-235B-A22B+PhysicsMinions achieves overall No.1 on IPhO 2025, and obtains the highest average score over the 13 physics competitions.
arXiv Detail & Related papers (2025-11-17T17:18:13Z) - LOCA-R: Near-Perfect Performance on the Chinese Physics Olympiad 2025 [3.5580730009417016]
We introduce LOCA-R (LOgical Chain Augmentation for Reasoning), an improved version of the LOCA framework adapted for complex reasoning.<n>LOCA-R achieves a near-perfect score of 313 out of 320 points, solidly surpassing the highest-scoring human competitor.
arXiv Detail & Related papers (2025-11-13T17:20:46Z) - PhysicsMinions: Winning Gold Medals in the Latest Physics Olympiads with a Coevolutionary Multimodal Multi-Agent System [65.02248709992442]
Physics is central to understanding and shaping the real world, and the ability to solve physics problems is a key indicator of real-world physical intelligence.<n>Existing approaches are predominantly single-model based, and open-source MLLMs rarely reach gold-medal-level performance.<n>We propose PhysicsMinions, a coevolutionary multi-agent system for Physics Olympiad.<n>Its architecture features three synergistic studios: a Visual Studio to interpret diagrams, a Logic Studio to formulate solutions, and a Review Studio to perform dual-stage verification.
arXiv Detail & Related papers (2025-09-29T14:40:53Z) - HiPhO: How Far Are (M)LLMs from Humans in the Latest High School Physics Olympiad Benchmark? [53.76627321546095]
HiPhO is the first benchmark dedicated to high school physics Olympiads with human-aligned evaluation.<n>It compiles 13 latest Olympiad exams from 2024-2025, spanning both international and regional competitions.<n>We assign gold, silver, and bronze medals to models based on official medal thresholds, thereby enabling direct comparison between (M)LLMs and human contestants.
arXiv Detail & Related papers (2025-09-09T16:24:51Z) - Physics Supernova: AI Agent Matches Elite Gold Medalists at IPhO 2025 [55.8464246603186]
We introduce Physics Supernova, an AI system with superior physics problem-solving abilities.<n>Supernova attains 23.5/30 points, ranking 14th of 406 contestants and surpassing the median performance of human gold medalists.<n>These results show that principled tool integration within agent systems can deliver competitive improvements.
arXiv Detail & Related papers (2025-09-01T17:59:13Z) - Winning Gold at IMO 2025 with a Model-Agnostic Verification-and-Refinement Pipeline [10.177917426690703]
Large language models often struggle with Olympiad-level problems.<n>We construct a model-agnostic, verification-and-refinement pipeline.<n>We demonstrate its effectiveness on the recent IMO 2025.
arXiv Detail & Related papers (2025-07-21T17:59:49Z) - PhysUniBench: An Undergraduate-Level Physics Reasoning Benchmark for Multimodal Models [69.73115077227969]
We present PhysUniBench, a large-scale benchmark designed to evaluate and improve the reasoning capabilities of large language models (MLLMs)<n>PhysUniBench consists of 3,304 physics questions spanning 8 major sub-disciplines of physics, each accompanied by one visual diagram.<n>The benchmark's construction involved a rigorous multi-stage process, including multiple roll-outs, expert-level evaluation, automated filtering of easily solved problems, and a nuanced difficulty grading system with five levels.
arXiv Detail & Related papers (2025-06-21T09:55:42Z) - OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems [62.06169250463104]
We present OlympiadBench, an Olympiad-level bilingual multimodal scientific benchmark, featuring 8,476 problems from Olympiad-level mathematics and physics competitions.
The best-performing model, GPT-4V, attains an average score of 17.97% on OlympiadBench, with a mere 10.74% in physics.
Our analysis orienting GPT-4V points out prevalent issues with hallucinations, knowledge omissions, and logical fallacies.
arXiv Detail & Related papers (2024-02-21T18:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.