Mathematics with large language models as provers and verifiers
- URL: http://arxiv.org/abs/2510.12829v3
- Date: Thu, 06 Nov 2025 09:23:35 GMT
- Title: Mathematics with large language models as provers and verifiers
- Authors: Hieu Le Duc, Leo Liberti,
- Abstract summary: ChatGPT solves five out of six IMO problems, and close about a third of the sixty-six number theory conjectures in [Cohen, Journal of Sequences, 2025].
- Score: 1.1029146548022293
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: During 2024 and 2025 the discussion about the theorem-proving capabilities of large language models started reporting interesting success stories, mostly to do with difficult exercises (such as problems from the International Mathematical Olympiad), but also with conjectures [Feldman & Karbasi, arXiv:2509.18383v1] formulated for the purpose of verifying whether the artificial intelligence could prove it. In this paper we report a theorem proving feat achieved by ChatGPT by using a protocol involving different prover and verifier instances of the gpt-5 model working collaboratively. To make sure that the produced proofs do not suffer from hallucinations, the final proof is formally verified by the lean proof assistant, and the conformance of premises and conclusion of the lean code is verified by a human. Our methodology is by no means complete or exact. It was nonetheless able to solve five out of six 2025 IMO problems, and close about a third of the sixty-six number theory conjectures in [Cohen, Journal of Integer Sequences, 2025].
Related papers
- Gödel Test: Can Large Language Models Solve Easy Conjectures? [40.906606632144694]
We propose the G"odel Test: evaluating whether a model can produce correct proofs for very simple, previously unsolved conjectures.<n>We study the performance of GPT-5 on five conjectures in algorithm optimization.<n>GPT-5 may represent an early step toward frontier models eventually passing the G"odel Test.
arXiv Detail & Related papers (2025-09-22T20:11:40Z) - Lean Meets Theoretical Computer Science: Scalable Synthesis of Theorem Proving Challenges in Formal-Informal Pairs [41.29431283264807]
We propose leveraging theoretical computer science (TCS) as a scalable source of rigorous proof problems.<n>We demonstrate this approach on two TCS domains: Busy Beaver problems, which involve proving bounds on Turing machine halting behavior, and Mixed Boolean Arithmetic problems, which combine logical and arithmetic reasoning.<n>Our framework automatically synthesizes problems with parallel formal (Lean4) and informal (Markdown) specifications, creating a scalable pipeline for generating verified proof challenges.
arXiv Detail & Related papers (2025-08-21T14:15:40Z) - Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification [56.218970738892764]
Chain-of-Thought prompting has become the de facto method to elicit reasoning capabilities from large language models (LLMs)<n>To mitigate hallucinations in CoT that are notoriously difficult to detect, current methods operate as opaque boxes and do not provide checkable evidence for their judgments, possibly limiting their effectiveness.<n>We propose a retrospective, step-aware formal verification framework $Safe$. Rather than assigning arbitrary scores, we strive to articulate mathematical claims in formal mathematical language Lean 4 at each reasoning step and provide formal proofs to identify hallucinations.
arXiv Detail & Related papers (2025-06-05T03:16:08Z) - Enumerate-Conjecture-Prove: Formally Solving Answer-Construction Problems in Math Competitions [37.10426226729792]
We introduce the LLMe-Conjecture-Prove (ECP) framework, a modular neuro-symbolic method integrating pattern-driven conjecturing with formal theorem proving.<n>We present ConstructiveBench, a dataset of 3,431 answer-Thought problems in various math competitions with verified Lean formalizations.
arXiv Detail & Related papers (2025-05-24T03:52:25Z) - LeanProgress: Guiding Search for Neural Theorem Proving via Proof Progress Prediction [74.79306773878955]
We introduce LeanProgress, a method that predicts the progress in the proof.<n>Our experiments show that LeanProgress achieves an overall prediction accuracy of 75.1%.
arXiv Detail & Related papers (2025-02-25T07:46:36Z) - Generating Millions Of Lean Theorems With Proofs By Exploring State Transition Graphs [6.65877320351217]
We developed LeanNavigator, a novel method for generating a large-scale dataset of Lean theorems and proofs.<n>We generated 4.7 million theorems totaling 1 billion tokens, surpassing previous datasets by more than an order of magnitude.<n>Using this extensive dataset, we trained an AI model that outperforms the state-of-the-art ReProver model in theorem-proving tasks.
arXiv Detail & Related papers (2025-02-16T06:20:39Z) - Alchemy: Amplifying Theorem-Proving Capability through Symbolic Mutation [71.32761934724867]
This work proposes Alchemy, a framework for data synthesis that constructs formal theorems through symbolic mutation.<n>For each candidate theorem in Mathlib, we identify all invocable theorems that can be used to rewrite or apply to it.<n>As a result, our method increases the number of theorems in Mathlib by an order of magnitude, from 110k to 6M.
arXiv Detail & Related papers (2024-10-21T08:04:21Z) - Lean-STaR: Learning to Interleave Thinking and Proving [53.923617816215774]
We present Lean-STaR, a framework for training language models to produce informal thoughts prior to each step of a proof.<n>Lean-STaR achieves state-of-the-art results on the miniF2F-test benchmark within the Lean theorem proving environment.
arXiv Detail & Related papers (2024-07-14T01:43:07Z) - DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data [65.5290035371111]
We introduce an approach to generate extensive Lean 4 proof data derived from high-school and undergraduate-level mathematical competition problems.
We fine-tune the DeepSeekMath 7B model on this synthetic dataset, which comprises 8 million formal statements with proofs.
Our model successfully proved 5 out of 148 problems in the Lean 4 Formalized International Mathematical Olympiad (FIMO) benchmark, while GPT-4 failed to prove any.
arXiv Detail & Related papers (2024-05-23T09:03:42Z) - Towards a Mathematics Formalisation Assistant using Large Language
Models [5.485439959027125]
We explore the abilities of a large language model (Codex) to help with formalisation in the Lean theorem prover.
Codex is able to formalise short mathematical statements at undergrad level with nearly 75% accuracy for $120$ theorem statements.
We show that with a new prompting strategy Codex can formalise these proofs in natural language with at least one out of twelve Codex completion being easy to repair into a complete proof.
arXiv Detail & Related papers (2022-11-14T16:52:32Z) - PRover: Proof Generation for Interpretable Reasoning over Rules [81.40404921232192]
We propose a transformer-based model that answers binary questions over rule-bases and generates the corresponding proofs.
Our model learns to predict nodes and edges corresponding to proof graphs in an efficient constrained training paradigm.
We conduct experiments on synthetic, hand-authored, and human-paraphrased rule-bases to show promising results for QA and proof generation.
arXiv Detail & Related papers (2020-10-06T15:47:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.