MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics
- URL: http://arxiv.org/abs/2109.00110v1
- Date: Tue, 31 Aug 2021 23:21:12 GMT
- Title: MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics
- Authors: Kunhao Zheng, Jesse Michael Han, Stanislas Polu
- Abstract summary: miniF2F is a dataset of formal Olympiad-level mathematics problems statements intended to provide a unified cross-system benchmark for neural theorem proving.
The miniF2F benchmark currently targets Metamath, Lean, and Isabelle and consists of 488 problem statements drawn from the AIME, AMC, and the International Mathematical Olympiad (IMO)
We report baseline results using GPT-f, a neural theorem prover based on GPT-3 and provide an analysis of its performance.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present miniF2F, a dataset of formal Olympiad-level mathematics problems
statements intended to provide a unified cross-system benchmark for neural
theorem proving. The miniF2F benchmark currently targets Metamath, Lean, and
Isabelle and consists of 488 problem statements drawn from the AIME, AMC, and
the International Mathematical Olympiad (IMO), as well as material from
high-school and undergraduate mathematics courses. We report baseline results
using GPT-f, a neural theorem prover based on GPT-3 and provide an analysis of
its performance. We intend for miniF2F to be a community-driven effort and hope
that our benchmark will help spur advances in neural theorem proving.
Related papers
- LeanConjecturer: Automatic Generation of Mathematical Conjectures for Theorem Proving [6.220998637943786]
We introduce LeanConjecturer, a pipeline for automatically generating university-level mathematical conjectures in Lean 4 using Large Language Models (LLMs)<n>Through iterative generation and evaluation, LeanConjecturer produced 12,289 conjectures from 40 Mathlib seed files, with 3,776 identified as syntactically valid and non-trivial.
arXiv Detail & Related papers (2025-06-27T08:17:18Z) - Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models [86.45058529521258]
OlymMATH is a novel Olympiad-level mathematical benchmark designed to rigorously test the complex reasoning capabilities of LLMs.
OlymMATH features 200 meticulously curated problems, each manually verified and available in parallel English and Chinese versions.
arXiv Detail & Related papers (2025-03-27T11:20:17Z) - Alchemy: Amplifying Theorem-Proving Capability through Symbolic Mutation [71.32761934724867]
This work proposes Alchemy, a framework for data synthesis that constructs formal theorems through symbolic mutation.
For each candidate theorem in Mathlib, we identify all invocable theorems that can be used to rewrite or apply to it.
As a result, our method increases the number of theorems in Mathlib by an order of magnitude, from 110k to 6M.
arXiv Detail & Related papers (2024-10-21T08:04:21Z) - Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models [63.31878920079154]
We propose a benchmark specifically designed to assess large language models' mathematical reasoning at the Olympiad level.
Unlike existing Olympiad-related benchmarks, our dataset focuses exclusively on mathematics.
Our experimental results show that even the most advanced models, OpenAI o1-mini and OpenAI o1-preview, struggle with highly challenging Olympiad-level problems.
arXiv Detail & Related papers (2024-10-10T14:39:33Z) - Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement [71.46993852662021]
We present a series of math-specific large language models: Qwen2.5-Math and Qwen2.5-Math-Instruct-1.5B/7B/72B.
Qwen2.5-Math-Instruct supports both Chinese and English, and possess advanced mathematical reasoning capabilities.
arXiv Detail & Related papers (2024-09-18T16:45:37Z) - SubgoalXL: Subgoal-based Expert Learning for Theorem Proving [37.115856591703974]
SubgoalXL is a novel approach that synergizes subgoal-based proofs with expert learning to enhance formal theorem proving.
SubgoalXL achieves a new state-of-the-art performance of 56.1% in Isabelle on the standard miniF2F dataset.
arXiv Detail & Related papers (2024-08-20T20:10:53Z) - PutnamBench: Evaluating Neural Theorem-Provers on the Putnam Mathematical Competition [6.971356163199923]
PutnamBench consists of 1692 hand-constructed formalizations of 640 theorems sourced from the William Lowell Putnam Mathematical Competition.
PutnamBench requires significant problem-solving ability and proficiency in a broad range of topics taught in undergraduate mathematics courses.
arXiv Detail & Related papers (2024-07-15T19:57:15Z) - MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark [82.64129627675123]
MathBench is a new benchmark that rigorously assesses the mathematical capabilities of large language models.
MathBench spans a wide range of mathematical disciplines, offering a detailed evaluation of both theoretical understanding and practical problem-solving skills.
arXiv Detail & Related papers (2024-05-20T17:52:29Z) - ConceptMath: A Bilingual Concept-wise Benchmark for Measuring
Mathematical Reasoning of Large Language Models [67.32868432113587]
This paper introduces ConceptMath, a fine-grained benchmark that evaluates concept-wise mathematical reasoning of Large Language Models (LLMs)
Unlike traditional benchmarks that evaluate general mathematical reasoning with an average accuracy, ConceptMath systematically organizes math problems under a hierarchy of math concepts.
arXiv Detail & Related papers (2024-02-22T16:06:49Z) - FIMO: A Challenge Formal Dataset for Automated Theorem Proving [31.695624833932577]
FIMO is designed to facilitate advanced automated theorem proving at the IMO level.
It comprises 149 formal problem statements, accompanied by both informal problem descriptions and their corresponding-based informal proofs.
arXiv Detail & Related papers (2023-09-08T12:34:28Z) - Generative Language Modeling for Automated Theorem Proving [94.01137612934842]
This work is motivated by the possibility that a major limitation of automated theorem provers compared to humans might be addressable via generation from language models.
We present an automated prover and proof assistant, GPT-f, for the Metamath formalization language, and analyze its performance.
arXiv Detail & Related papers (2020-09-07T19:50:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.