PutnamBench: Evaluating Neural Theorem-Provers on the Putnam Mathematical Competition
- URL: http://arxiv.org/abs/2407.11214v2
- Date: Sun, 3 Nov 2024 17:14:37 GMT
- Title: PutnamBench: Evaluating Neural Theorem-Provers on the Putnam Mathematical Competition
- Authors: George Tsoukalas, Jasper Lee, John Jennings, Jimmy Xin, Michelle Ding, Michael Jennings, Amitayush Thakur, Swarat Chaudhuri,
- Abstract summary: PutnamBench consists of 1692 hand-constructed formalizations of 640 theorems sourced from the William Lowell Putnam Mathematical Competition.
PutnamBench requires significant problem-solving ability and proficiency in a broad range of topics taught in undergraduate mathematics courses.
- Score: 6.971356163199923
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present PutnamBench, a new multi-language benchmark for evaluating the ability of neural theorem-provers to solve competition mathematics problems. PutnamBench consists of 1692 hand-constructed formalizations of 640 theorems sourced from the William Lowell Putnam Mathematical Competition, the premier undergraduate-level mathematics competition in North America. All the problems have formalizations in Lean 4 and Isabelle; a substantial subset also has Coq formalizations. PutnamBench requires significant problem-solving ability and proficiency in a broad range of topics taught in undergraduate mathematics courses. We use PutnamBench to evaluate several established neural and symbolic theorem-provers. These approaches can only solve a handful of the PutnamBench problems, establishing the benchmark as a difficult open challenge for research on neural theorem-proving. PutnamBench is available at https://github.com/trishullab/PutnamBench.
Related papers
- CombiBench: Benchmarking LLM Capability for Combinatorial Mathematics [45.32204834230928]
We introduce CombiBench, a comprehensive benchmark comprising 100 problems, each formalized in Lean4 and paired with its corresponding informal statement.<n> CombiBench is suitable for testing solving problems since 2000 (except IMO 2004 P3 as its statement contain an images)<n>We also provide a comprehensive and standardized evaluation framework, dubbed Fine-Eval, for formal mathematics.
arXiv Detail & Related papers (2025-05-06T04:32:17Z) - Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models [86.45058529521258]
OlymMATH is a novel Olympiad-level mathematical benchmark designed to rigorously test the complex reasoning capabilities of LLMs.
OlymMATH features 200 meticulously curated problems, each manually verified and available in parallel English and Chinese versions.
arXiv Detail & Related papers (2025-03-27T11:20:17Z) - PromptCoT: Synthesizing Olympiad-level Problems for Mathematical Reasoning in Large Language Models [59.920971312822736]
We introduce PromptCoT, a novel approach for automatically generating high-quality Olympiad-level math problems.
The proposed method synthesizes complex problems based on mathematical concepts and the rationale behind problem construction.
Our method is evaluated on standard benchmarks including GSM8K, MATH-500, and AIME2024, where it consistently outperforms existing problem generation methods.
arXiv Detail & Related papers (2025-03-04T06:32:30Z) - An unholy trinity: TFNP, polynomial systems, and the quantum satisfiability problem [0.0]
We define two new subclasses of Total Function NP (TFNP) based on the study of complex systems.
At the heart of our study is the computational problem known as Quantum SAT with a System of Distinct Representatives (SDR)
We show how to embed the roots of a sparse, high-degree complexity into QSAT with SDR, obtaining that SFTA is contained in a zero-error version of SDR.
arXiv Detail & Related papers (2024-12-27T12:57:06Z) - FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI [2.0608396919601493]
FrontierMath is a benchmark of hundreds of original, exceptionally challenging mathematics problems crafted and vetted by expert mathematicians.
Current state-of-the-art AI models solve under 2% of problems, revealing a vast gap between AI capabilities and the prowess of the mathematical community.
As AI systems advance toward expert-level mathematical abilities, FrontierMath offers a rigorous testbed that quantifies their progress.
arXiv Detail & Related papers (2024-11-07T17:07:35Z) - Another quantum version of Sanov theorem [53.64687146666141]
We study how to extend Sanov theorem to the quantum setting.
We propose another quantum version of Sanov theorem by considering the quantum analog of the empirical distribution.
arXiv Detail & Related papers (2024-07-26T07:46:30Z) - MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark [82.64129627675123]
MathBench is a new benchmark that rigorously assesses the mathematical capabilities of large language models.
MathBench spans a wide range of mathematical disciplines, offering a detailed evaluation of both theoretical understanding and practical problem-solving skills.
arXiv Detail & Related papers (2024-05-20T17:52:29Z) - Rigor with Machine Learning from Field Theory to the Poincar\'e
Conjecture [0.0]
We discuss techniques for obtaining rigor in the natural sciences with machine learning.
Non-rigorous methods may lead to rigorous results via conjecture generation or verification by reinforcement learning.
One can also imagine building direct bridges between machine learning theory and either mathematics or theoretical physics.
arXiv Detail & Related papers (2024-02-20T19:00:59Z) - TheoremQA: A Theorem-driven Question Answering dataset [100.39878559382694]
GPT-4's capabilities to solve these problems are unparalleled, achieving an accuracy of 51% with Program-of-Thoughts Prompting.
TheoremQA is curated by domain experts containing 800 high-quality questions covering 350 theorems.
arXiv Detail & Related papers (2023-05-21T17:51:35Z) - Connes implies Tsirelson: a simple proof [91.3755431537592]
We show that the Connes embedding problem implies the synchronous Tsirelson conjecture.
We also give a different construction of Connes' algebra $mathcalRomega$ appearing in the Connes embedding problem.
arXiv Detail & Related papers (2022-09-16T13:59:42Z) - JiuZhang: A Chinese Pre-trained Language Model for Mathematical Problem
Understanding [74.12405417718054]
This paper aims to advance the mathematical intelligence of machines by presenting the first Chinese mathematical pre-trained language model(PLM)
Unlike other standard NLP tasks, mathematical texts are difficult to understand, since they involve mathematical terminology, symbols and formulas in the problem statement.
We design a novel curriculum pre-training approach for improving the learning of mathematical PLMs, consisting of both basic and advanced courses.
arXiv Detail & Related papers (2022-06-13T17:03:52Z) - MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics [0.0]
miniF2F is a dataset of formal Olympiad-level mathematics problems statements intended to provide a unified cross-system benchmark for neural theorem proving.
The miniF2F benchmark currently targets Metamath, Lean, and Isabelle and consists of 488 problem statements drawn from the AIME, AMC, and the International Mathematical Olympiad (IMO)
We report baseline results using GPT-f, a neural theorem prover based on GPT-3 and provide an analysis of its performance.
arXiv Detail & Related papers (2021-08-31T23:21:12Z) - A refinement of Reznick's Positivstellensatz with applications to
quantum information theory [72.8349503901712]
In Hilbert's 17th problem Artin showed that any positive definite in several variables can be written as the quotient of two sums of squares.
Reznick showed that the denominator in Artin's result can always be chosen as an $N$-th power of the squared norm of the variables.
arXiv Detail & Related papers (2019-09-04T11:46:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.