MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics
- URL: http://arxiv.org/abs/2109.00110v1
- Date: Tue, 31 Aug 2021 23:21:12 GMT
- Title: MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics
- Authors: Kunhao Zheng, Jesse Michael Han, Stanislas Polu
- Abstract summary: miniF2F is a dataset of formal Olympiad-level mathematics problems statements intended to provide a unified cross-system benchmark for neural theorem proving.
The miniF2F benchmark currently targets Metamath, Lean, and Isabelle and consists of 488 problem statements drawn from the AIME, AMC, and the International Mathematical Olympiad (IMO)
We report baseline results using GPT-f, a neural theorem prover based on GPT-3 and provide an analysis of its performance.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present miniF2F, a dataset of formal Olympiad-level mathematics problems
statements intended to provide a unified cross-system benchmark for neural
theorem proving. The miniF2F benchmark currently targets Metamath, Lean, and
Isabelle and consists of 488 problem statements drawn from the AIME, AMC, and
the International Mathematical Olympiad (IMO), as well as material from
high-school and undergraduate mathematics courses. We report baseline results
using GPT-f, a neural theorem prover based on GPT-3 and provide an analysis of
its performance. We intend for miniF2F to be a community-driven effort and hope
that our benchmark will help spur advances in neural theorem proving.
Related papers
- PutnamBench: Evaluating Neural Theorem-Provers on the Putnam Mathematical Competition [6.971356163199923]
PutnamBench consists of 1697 hand-constructed formalizations of 640 theorems sourced from the William Lowell Putnam Mathematical Competition.
All the theorems have formalizations in Lean 4 and Isabelle; a substantial subset also has Coq formalizations.
We use PutnamBench to evaluate several established neural and symbolic theorem-provers.
arXiv Detail & Related papers (2024-07-15T19:57:15Z) - A Critical Analysis of the Theoretical Framework of the Extreme Learning Machine [1.9503475832401784]
We refute the proofs of two main statements and a dataset that provides a counterexample to the ELM learning algorithm.
We provide alternative statements of the foundations, which justify the efficiency of ELM in some theoretical cases.
arXiv Detail & Related papers (2024-06-25T10:06:07Z) - DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data [65.5290035371111]
We introduce an approach to generate extensive Lean 4 proof data derived from high-school and undergraduate-level mathematical competition problems.
We fine-tune the DeepSeekMath 7B model on this synthetic dataset, which comprises 8 million formal statements with proofs.
Our model successfully proved 5 out of 148 problems in the Lean 4 Formalized International Mathematical Olympiad (FIMO) benchmark, while GPT-4 failed to prove any.
arXiv Detail & Related papers (2024-05-23T09:03:42Z) - MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark [82.64129627675123]
MathBench is a new benchmark that rigorously assesses the mathematical capabilities of large language models.
MathBench spans a wide range of mathematical disciplines, offering a detailed evaluation of both theoretical understanding and practical problem-solving skills.
arXiv Detail & Related papers (2024-05-20T17:52:29Z) - ConceptMath: A Bilingual Concept-wise Benchmark for Measuring
Mathematical Reasoning of Large Language Models [67.32868432113587]
This paper introduces ConceptMath, a fine-grained benchmark that evaluates concept-wise mathematical reasoning of Large Language Models (LLMs)
Unlike traditional benchmarks that evaluate general mathematical reasoning with an average accuracy, ConceptMath systematically organizes math problems under a hierarchy of math concepts.
arXiv Detail & Related papers (2024-02-22T16:06:49Z) - MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data [85.50740598523818]
MUSTARD is a framework that masters uniform synthesis of theorem and proof data of high quality and diversity.
We present a theorem-and-proof benchmark MUSTARDSAUCE with 5,866 valid data points.
We perform extensive analysis and demonstrate that MUSTARD generates validated high-quality step-by-step data.
arXiv Detail & Related papers (2024-02-14T05:57:58Z) - Llemma: An Open Language Model For Mathematics [46.557804525919785]
We present Llemma, a large language model for mathematics.
On the MATH benchmark Llemma outperforms all known open base models.
Llemma is capable of tool use and formal theorem proving without any further finetuning.
arXiv Detail & Related papers (2023-10-16T17:54:07Z) - FIMO: A Challenge Formal Dataset for Automated Theorem Proving [31.695624833932577]
FIMO is designed to facilitate advanced automated theorem proving at the IMO level.
It comprises 149 formal problem statements, accompanied by both informal problem descriptions and their corresponding-based informal proofs.
arXiv Detail & Related papers (2023-09-08T12:34:28Z) - ProofNet: Autoformalizing and Formally Proving Undergraduate-Level
Mathematics [7.607254619341369]
We introduce ProofNet, a benchmark for autoformalization and formal proving of undergraduate-level mathematics.
The ProofNet benchmarks consists of 371 examples, each consisting of a formal theorem statement in Lean 3.
We report baseline results on statement autoformalization via in-context learning.
arXiv Detail & Related papers (2023-02-24T03:28:46Z) - Generalization Guarantee of Training Graph Convolutional Networks with
Graph Topology Sampling [83.77955213766896]
Graph convolutional networks (GCNs) have recently achieved great empirical success in learning graphstructured data.
To address its scalability issue, graph topology sampling has been proposed to reduce the memory and computational cost of training Gs.
This paper provides first theoretical justification of graph topology sampling in training (up to) three-layer GCNs.
arXiv Detail & Related papers (2022-07-07T21:25:55Z) - Generative Language Modeling for Automated Theorem Proving [94.01137612934842]
This work is motivated by the possibility that a major limitation of automated theorem provers compared to humans might be addressable via generation from language models.
We present an automated prover and proof assistant, GPT-f, for the Metamath formalization language, and analyze its performance.
arXiv Detail & Related papers (2020-09-07T19:50:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.