Related papers: Gold-Medal-Level Olympiad Geometry Solving with Efficient Heuristic Auxiliary Constructions

Gold-Medal-Level Olympiad Geometry Solving with Efficient Heuristic Auxiliary Constructions

URL: http://arxiv.org/abs/2512.00097v1
Date: Thu, 27 Nov 2025 01:05:00 GMT
Title: Gold-Medal-Level Olympiad Geometry Solving with Efficient Heuristic Auxiliary Constructions
Authors: Boyan Duan, Xiao Liang, Shuai Lu, Yaoxiang Wang, Yelong Shen, Kai-Wei Chang, Ying Nian Wu, Mao Yang, Weizhu Chen, Yeyun Gong,
Abstract summary: We present a highly efficient method for geometry theorem proving that runs entirely on CPUs without relying on neural network-based inference.<n>Our initial study shows that a simple random strategy for adding auxiliary points can achieve silver-medal level human performance on International Mathematical Olympiad (IMO)<n>We further construct HAGeo-409, a benchmark consisting of 409 geometry problems with human-assessed difficulty levels.
Score: 129.877899436804
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automated theorem proving in Euclidean geometry, particularly for International Mathematical Olympiad (IMO) level problems, remains a major challenge and an important research focus in Artificial Intelligence. In this paper, we present a highly efficient method for geometry theorem proving that runs entirely on CPUs without relying on neural network-based inference. Our initial study shows that a simple random strategy for adding auxiliary points can achieve silver-medal level human performance on IMO. Building on this, we propose HAGeo, a Heuristic-based method for adding Auxiliary constructions in Geometric deduction that solves 28 of 30 problems on the IMO-30 benchmark, achieving gold-medal level performance and surpassing AlphaGeometry, a competitive neural network-based approach, by a notable margin. To evaluate our method and existing approaches more comprehensively, we further construct HAGeo-409, a benchmark consisting of 409 geometry problems with human-assessed difficulty levels. Compared with the widely used IMO-30, our benchmark poses greater challenges and provides a more precise evaluation, setting a higher bar for geometry theorem proving.

Related papers

NoReGeo: Non-Reasoning Geometry Benchmark [5.288175082601994]
NoReGeo is a novel benchmark designed to evaluate the intrinsic geometric understanding of large language models (LLMs)<n>Our benchmark comprises 2,500 trivial geometric problems spanning 25 categories, each carefully crafted to be solvable purely through native geometric understanding.<n>We assess a range of state-of-the-art models on NoReGeo, including frontier models like GPT-4, observing that even the most advanced systems achieve an overall maximum of 65% accuracy in binary classification tasks.
arXiv Detail & Related papers (2026-01-15T10:22:55Z)
GeoBench: Rethinking Multimodal Geometric Problem-Solving via Hierarchical Evaluation [48.04396968707237]
We present GeoBench, a hierarchical benchmark featuring four reasoning levels in geometric problem-solving.<n>We systematically assess capabilities ranging from attribute extraction to logical error correction.<n>These findings establish GeoBench as a comprehensive benchmark while offering actionable guidelines for developing geometric problem-solving systems.
arXiv Detail & Related papers (2025-12-30T09:56:37Z)
Achieving Olympia-Level Geometry Large Language Model Agent via Complexity Boosting Reinforcement Learning [66.79506488139707]
Large language model (LLM) agents exhibit strong mathematical problem-solving abilities.<n>In this work, we make the first attempt to build a medalist-level LLM agent for geometry and present InternGeometry.<n> InternGeometry overcomes the limitations in geometry by iteratively proposing propositions and auxiliary constructions, verifying them with a symbolic engine.<n>Built on InternThinker-32B, InternGeometry solves 44 of 50 IMO geometry problems, exceeding the average gold medalist score (40.9), using only 13K training examples.
arXiv Detail & Related papers (2025-12-11T11:05:04Z)
GeoGramBench: Benchmarking the Geometric Program Reasoning in Modern LLMs [7.605833826892782]
We present a benchmark of 500 carefully refined problems organized by a tailored three-level taxonomy that considers geometric complexity rather than traditional mathematical reasoning complexity.<n>Our comprehensive evaluation of 17 frontier LLMs reveals consistent and pronounced deficiencies.<n>These results highlight the unique challenges posed by program-driven spatial reasoning and establish GeoGramBench as a valuable resource for advancing research in symbolic-to-spatial geometric reasoning.
arXiv Detail & Related papers (2025-05-23T09:17:07Z)
PromptCoT: Synthesizing Olympiad-level Problems for Mathematical Reasoning in Large Language Models [59.920971312822736]
We introduce PromptCoT, a novel approach for automatically generating high-quality Olympiad-level math problems.<n>The proposed method synthesizes complex problems based on mathematical concepts and the rationale behind problem construction.<n>Our method is evaluated on standard benchmarks including GSM8K, MATH-500, and AIME2024, where it consistently outperforms existing problem generation methods.
arXiv Detail & Related papers (2025-03-04T06:32:30Z)
Proposing and solving olympiad geometry with guided tree search [63.824930029019995]
We introduce TongGeometry, a Euclidean geometry system supporting tree-search-based guided problem proposing and solving.<n>TongGeometry discovers 6.7 billion geometry theorems requiring auxiliary constructions, including 4.1 billion exhibiting geometric symmetry.<n>TongGeometry solved all International Mathematical Olympiad geometry in IMO-AG-30, outperforming gold medalists for the first time.
arXiv Detail & Related papers (2024-12-14T04:20:47Z)
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models [63.31878920079154]
We propose a benchmark specifically designed to assess large language models' mathematical reasoning at the Olympiad level.<n>Unlike existing Olympiad-related benchmarks, our dataset focuses exclusively on mathematics and comprises a vast collection of 4428 competition-level problems with rigorous human annotation.<n>Our experimental results show that even the most advanced models, OpenAI o1-mini and OpenAI o1-preview, struggle with highly challenging Olympiad-level problems, with 60.54% and 52.55% accuracy, highlighting significant challenges in Olympiad-level mathematical reasoning.
arXiv Detail & Related papers (2024-10-10T14:39:33Z)
Wu's Method can Boost Symbolic AI to Rival Silver Medalists and AlphaGeometry to Outperform Gold Medalists at IMO Geometry [16.41436428888792]
We revisit the IMO-AG-30 Challenge introduced with AlphaGeometry and find that Wu's method is surprisingly strong. Wu's method alone can solve 15 problems, and some of them are not solved by any of the other methods. We set a new state-of-the-art for automated theorem proving on IMO-AG-30, solving 27 out of 30 problems, the first AI method which outperforms an IMO gold medalist.
arXiv Detail & Related papers (2024-04-09T15:54:00Z)
FGeo-TP: A Language Model-Enhanced Solver for Geometry Problems [1.137457877869062]
We introduce FGeo-TP (Theorem Predictor), which utilizes the language model to predict theorem sequences for solving geometry problems. Our results demonstrate a significant increase in the problem-solving rate of the language model-enhanced FGeo-TP on the FormalGeo7k dataset.
arXiv Detail & Related papers (2024-02-14T09:44:28Z)
GeoQA: A Geometric Question Answering Benchmark Towards Multimodal Numerical Reasoning [172.36214872466707]
We focus on solving geometric problems, which requires a comprehensive understanding of textual descriptions, visual diagrams, and theorem knowledge. We propose a Geometric Question Answering dataset GeoQA, containing 5,010 geometric problems with corresponding annotated programs.
arXiv Detail & Related papers (2021-05-30T12:34:17Z)
Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning [123.06420835072225]
We construct a new large-scale benchmark, Geometry3K, consisting of 3,002 geometry problems with dense annotation in formal language. We propose a novel geometry solving approach with formal language and symbolic reasoning, called Interpretable Geometry Problem solver (Inter-GPS) Inter-GPS incorporates theorem knowledge as conditional rules and performs symbolic reasoning step by step.
arXiv Detail & Related papers (2021-05-10T07:46:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.