Related papers: AMO-Bench: Large Language Models Still Struggle in High School Math Competitions

AMO-Bench: Large Language Models Still Struggle in High School Math Competitions

URL: http://arxiv.org/abs/2510.26768v1
Date: Thu, 30 Oct 2025 17:52:02 GMT
Title: AMO-Bench: Large Language Models Still Struggle in High School Math Competitions
Authors: Shengnan An, Xunliang Cai, Xuezhi Cao, Xiaoyu Li, Yehao Lin, Junlin Liu, Xinxuan Lv, Dan Ma, Xuanlin Wang, Ziwen Wang, Shuang Zhou,
Abstract summary: AMO-Bench is an Advanced Mathematical reasoning benchmark with Olympiad level or even higher difficulty.<n>Each problem in AMO-Bench requires only a final answer rather than a proof, enabling automatic and robust grading for evaluation.<n> Experimental results across 26 LLMs on AMO-Bench show that even the best-performing model achieves only 52.4% accuracy.
Score: 22.16740349046417
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present AMO-Bench, an Advanced Mathematical reasoning benchmark with Olympiad level or even higher difficulty, comprising 50 human-crafted problems. Existing benchmarks have widely leveraged high school math competitions for evaluating mathematical reasoning capabilities of large language models (LLMs). However, many existing math competitions are becoming less effective for assessing top-tier LLMs due to performance saturation (e.g., AIME24/25). To address this, AMO-Bench introduces more rigorous challenges by ensuring all 50 problems are (1) cross-validated by experts to meet at least the International Mathematical Olympiad (IMO) difficulty standards, and (2) entirely original problems to prevent potential performance leakages from data memorization. Moreover, each problem in AMO-Bench requires only a final answer rather than a proof, enabling automatic and robust grading for evaluation. Experimental results across 26 LLMs on AMO-Bench show that even the best-performing model achieves only 52.4% accuracy on AMO-Bench, with most LLMs scoring below 40%. Beyond these poor performances, our further analysis reveals a promising scaling trend with increasing test-time compute on AMO-Bench. These results highlight the significant room for improving the mathematical reasoning in current LLMs. We release AMO-Bench to facilitate further research into advancing the reasoning abilities of language models. https://amo-bench.github.io/

Related papers

RIMO: An Easy-to-Evaluate, Hard-to-Solve Olympiad Benchmark for Advanced Mathematical Reasoning [26.173204350710833]
RIMO is a two-track benchmark designed to preserve peak Olympiad difficulty while eliminating evaluation noise.<n>The first track, RIMO-N, rewrites 335 problems to admit a single, unique integer answer, allowing for deterministic correctness checking.<n>The second track, RIMO-P, features 456 proof problems with expert-checked solutions, which are decomposed into a sequence of sub-problems to evaluate the step-by-step reasoning process.
arXiv Detail & Related papers (2025-09-09T13:13:51Z)
LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming? [88.29001498765629]
Large language models (LLMs) now outperform elite humans in competitive programming.<n>We revisit this claim, examining how LLMs differ from human experts and where limitations still remain.<n>We introduce LiveCodeBench Pro, a benchmark composed of problems from Codeforces, ICPC, and IOI.<n>A team of Olympiad medalists annotates every problem for algorithmic categories and conducts a line-by-line analysis of failed model-generated submissions.
arXiv Detail & Related papers (2025-06-13T16:29:09Z)
MathArena: Evaluating LLMs on Uncontaminated Math Competitions [4.655668424508813]
MathArena is a new benchmark for evaluating large language models (LLMs)<n>It is based on the following key insight: recurring math competitions provide a stream of high-quality, challenging problems.<n>MathArena is also the first benchmark for proof-writing capabilities.
arXiv Detail & Related papers (2025-05-29T09:28:06Z)
Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad [4.573289946657861]
We evaluate reasoning models on six problems from the 2025 USAMO.<n>Only Gemini-2.5-Pro achieves a non-trivial score of 25%.<n>Our results suggest that current LLMs are inadequate for rigorous mathematical reasoning tasks.
arXiv Detail & Related papers (2025-03-27T19:21:05Z)
Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models [86.45058529521258]
OlymMATH is a novel Olympiad-level mathematical benchmark designed to rigorously test the complex reasoning capabilities of LLMs.<n>OlymMATH features 200 meticulously curated problems, each manually verified and available in parallel English and Chinese versions.
arXiv Detail & Related papers (2025-03-27T11:20:17Z)
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models [63.31878920079154]
We propose a benchmark specifically designed to assess large language models' mathematical reasoning at the Olympiad level.<n>Unlike existing Olympiad-related benchmarks, our dataset focuses exclusively on mathematics and comprises a vast collection of 4428 competition-level problems with rigorous human annotation.<n>Our experimental results show that even the most advanced models, OpenAI o1-mini and OpenAI o1-preview, struggle with highly challenging Olympiad-level problems, with 60.54% and 52.55% accuracy, highlighting significant challenges in Olympiad-level mathematical reasoning.
arXiv Detail & Related papers (2024-10-10T14:39:33Z)
MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs [61.74749961334557]
MathHay is an automated benchmark designed to assess the long-context mathematical reasoning capabilities of LLMs. We conduct extensive experiments on MathHay to assess the long-context mathematical reasoning abilities of eight top-performing models.
arXiv Detail & Related papers (2024-10-07T02:30:07Z)
GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers [68.77382332826167]
Large language models (LLMs) have achieved impressive performance across various mathematical reasoning benchmarks. One essential and frequently occurring evidence is that when the math questions are slightly changed, LLMs can behave incorrectly. This motivates us to evaluate the robustness of LLMs' math reasoning capability by testing a wide range of question variations.
arXiv Detail & Related papers (2024-02-29T15:26:14Z)
OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems [62.06169250463104]
We present OlympiadBench, an Olympiad-level bilingual multimodal scientific benchmark, featuring 8,476 problems from Olympiad-level mathematics and physics competitions. The best-performing model, GPT-4V, attains an average score of 17.97% on OlympiadBench, with a mere 10.74% in physics. Our analysis orienting GPT-4V points out prevalent issues with hallucinations, knowledge omissions, and logical fallacies.
arXiv Detail & Related papers (2024-02-21T18:49:26Z)
Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For Large Language Models [23.344490944210456]
We present 515Bench, a more challenging benchmark dataset for evaluating the problem solving abilities of large language models (LLMs) We curate challenging pre-engineering mathematics, physics and chemistry problems from the highly competitive IIT-Advanced exam. Our evaluation on various open-source and proprietary models reveals that the highest performance, even after using techniques like self-consistency, self-refinement and chain-of-thought prompting, is less than 40%.
arXiv Detail & Related papers (2023-05-24T11:55:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.