Related papers: Winning Gold at IMO 2025 with a Model-Agnostic Verification-and-Refinement Pipeline

Related papers

mmJEE-Eval: A Bilingual Multimodal Benchmark for Evaluating Scientific Reasoning in Vision-Language Models [2.0467354053171243]
We introduce textbfmmJEE-Eval, a multimodal bilingual (English and Hindi) benchmark comprising 1,460 questions from India's Chemistry Advanced examination ( 2019-2025)<n>Our evaluation of 17 state-of-the-art models reveals that while frontier VLMs (GPT-5, Gemini 2.5 Pro/Flash) achieve 77-84% accuracy on held-out 2025 questions, open-source models plateau at 37-45% despite scaling to 400B parameters.
arXiv Detail & Related papers (2025-11-12T13:52:37Z)
Towards Robust Mathematical Reasoning [41.319782208621156]
We present IMO-Bench, a suite of advanced reasoning benchmarks vetted by a panel of top specialists.<n> IMO-AnswerBench first tests models on 400 diverse Olympiad problems with verifiable short answers.<n> IMO-Proof Bench is the next-level evaluation for proof-writing capabilities.
arXiv Detail & Related papers (2025-11-03T18:53:02Z)
BeyondBench: Benchmark-Free Evaluation of Reasoning in Language Models [13.380359214677176]
We introduce BeyondBench, an evaluation framework that avoids contamination from internet-scale training data.<n>Our framework covers 44 algorithmic tasks with a total of 117 variations, grouped into three difficulty levels.<n>We evaluate 101 language models, including 85 open-source and 16 closed-source models, spanning sizes from 0.5B to 141B parameters.
arXiv Detail & Related papers (2025-09-29T02:49:01Z)
WirelessMathLM: Teaching Mathematical Reasoning for LLMs in Wireless Communications with Reinforcement Learning [51.13280433665446]
Large language models (LLMs) excel at general mathematical reasoning but fail catastrophically on specialized technical mathematics.<n>In wireless communications, where problems require precise manipulation of information-theoretic bounds, even state-of-the-art models struggle to achieve competent performance.<n>We present WirelessMathLM, demonstrating that compact models (0.5B-7B parameters) can match or exceed much larger models through domain-specific reinforcement learning.
arXiv Detail & Related papers (2025-09-27T09:58:03Z)
Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving [36.20164235042574]
In this work, we propose textbfSeed-Prover, a lemma-style whole-proof reasoning model.<n>To solve IMO-level contest problems, we design three test-time inference strategies that enable both deep and broad reasoning.<n>Seed-Prover proves $78.1%$ of formalized past IMO problems, saturates MiniF2F, and achieves over 50% on PutnamBench, outperforming the previous state-of-the-art by a large margin.
arXiv Detail & Related papers (2025-07-31T17:00:30Z)
Towards Solving More Challenging IMO Problems via Decoupled Reasoning and Proving [48.22540519786074]
Recent studies show that the informal accuracy exceeds 80% while formal success remains below 8% on benchmarks like PutnamBench.<n>We propose a novel framework that decouples high-level reasoning from low-level proof generation.<n>We evaluate our method on a challenging set of post-2000 IMO problems, a problem set on which no prior open-source prover has reported success.
arXiv Detail & Related papers (2025-07-07T22:38:49Z)
LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming? [88.29001498765629]
Large language models (LLMs) now outperform elite humans in competitive programming.<n>We revisit this claim, examining how LLMs differ from human experts and where limitations still remain.<n>We introduce LiveCodeBench Pro, a benchmark composed of problems from Codeforces, ICPC, and IOI.<n>A team of Olympiad medalists annotates every problem for algorithmic categories and conducts a line-by-line analysis of failed model-generated submissions.
arXiv Detail & Related papers (2025-06-13T16:29:09Z)
Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad [0.8030359871216614]
We evaluate state-of-the-art reasoning models on six problems from the 2025 USAMO.<n>Only Gemini-2.5-Pro achieves a non-trivial score of 25%.<n>Our results suggest that current LLMs are inadequate for rigorous mathematical reasoning tasks.
arXiv Detail & Related papers (2025-03-27T19:21:05Z)
Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models [86.45058529521258]
OlymMATH is a novel Olympiad-level mathematical benchmark designed to rigorously test the complex reasoning capabilities of LLMs.<n>OlymMATH features 200 meticulously curated problems, each manually verified and available in parallel English and Chinese versions.
arXiv Detail & Related papers (2025-03-27T11:20:17Z)
PromptCoT: Synthesizing Olympiad-level Problems for Mathematical Reasoning in Large Language Models [59.920971312822736]
We introduce PromptCoT, a novel approach for automatically generating high-quality Olympiad-level math problems.<n>The proposed method synthesizes complex problems based on mathematical concepts and the rationale behind problem construction.<n>Our method is evaluated on standard benchmarks including GSM8K, MATH-500, and AIME2024, where it consistently outperforms existing problem generation methods.
arXiv Detail & Related papers (2025-03-04T06:32:30Z)
UGMathBench: A Diverse and Dynamic Benchmark for Undergraduate-Level Mathematical Reasoning with Large Language Models [11.964085209696051]
UGMathBench comprises 5,062 problems across 16 subjects and 111 topics, featuring 10 distinct answer types.<n>Each problem includes three randomized versions, with additional versions planned for release as leading open-source LLMs become saturated in UGMathBench.<n>Our evaluation of 23 leading LLMs reveals that the highest EAcc robustness achieved is 56.3% by OpenAI-o1-mini, with large $Delta$ values observed across different models.
arXiv Detail & Related papers (2025-01-23T15:46:43Z)
HARP: A challenging human-annotated math reasoning benchmark [7.691786865279827]
We introduce HARP, Human Annotated Reasoning Problems (for Math), consisting of 5,409 problems from the US national math competitions (A(J)HSME, AMC, AIME, USA(J)MO).<n>Of these, 4,780 have answers that are automatically check-able (with libraries such as SymPy).<n>These problems range six difficulty levels, with frontier models performing relatively poorly on the hardest bracket of 197 problems (average accuracy 41.1% for o1-mini, and 9.6% for Gemini 1.5 Pro).<n>Our dataset also features multiple choices (for 4,110 problems) and an average of two human-written
arXiv Detail & Related papers (2024-12-11T23:31:06Z)
ProcessBench: Identifying Process Errors in Mathematical Reasoning [62.80402845414901]
We introduce ProcessBench for measuring the ability to identify erroneous steps in mathematical reasoning.<n>ProcessBench consists of 3,400 test cases, primarily focused on competition- and Olympiad-level math problems.<n>We conduct extensive evaluation on ProcessBench, involving two types of models: process reward models (PRMs) and critic models.
arXiv Detail & Related papers (2024-12-09T15:11:40Z)
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models [63.31878920079154]
We propose a benchmark specifically designed to assess large language models' mathematical reasoning at the Olympiad level.<n>Unlike existing Olympiad-related benchmarks, our dataset focuses exclusively on mathematics and comprises a vast collection of 4428 competition-level problems with rigorous human annotation.<n>Our experimental results show that even the most advanced models, OpenAI o1-mini and OpenAI o1-preview, struggle with highly challenging Olympiad-level problems, with 60.54% and 52.55% accuracy, highlighting significant challenges in Olympiad-level mathematical reasoning.
arXiv Detail & Related papers (2024-10-10T14:39:33Z)
OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems [62.06169250463104]
We present OlympiadBench, an Olympiad-level bilingual multimodal scientific benchmark, featuring 8,476 problems from Olympiad-level mathematics and physics competitions. The best-performing model, GPT-4V, attains an average score of 17.97% on OlympiadBench, with a mere 10.74% in physics. Our analysis orienting GPT-4V points out prevalent issues with hallucinations, knowledge omissions, and logical fallacies.
arXiv Detail & Related papers (2024-02-21T18:49:26Z)
An In-depth Look at Gemini's Language Abilities [49.897870833250494]
We compare the abilities of the OpenAI GPT and Google Gemini models. We perform this analysis over 10 datasets testing a variety of language abilities. We find that Gemini Pro achieves accuracy that is close but slightly inferior to the corresponding GPT 3.5 Turbo.
arXiv Detail & Related papers (2023-12-18T18:47:42Z)
MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning [60.208045804204076]
We introduce MAmmoTH, a series of open-source large language models (LLMs) specifically tailored for general math problem-solving. The MAmmoTH models are trained on MathInstruct, our meticulously curated instruction tuning dataset.
arXiv Detail & Related papers (2023-09-11T17:47:22Z)
Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For Large Language Models [23.344490944210456]
We present 515Bench, a more challenging benchmark dataset for evaluating the problem solving abilities of large language models (LLMs) We curate challenging pre-engineering mathematics, physics and chemistry problems from the highly competitive IIT-Advanced exam. Our evaluation on various open-source and proprietary models reveals that the highest performance, even after using techniques like self-consistency, self-refinement and chain-of-thought prompting, is less than 40%.
arXiv Detail & Related papers (2023-05-24T11:55:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.