Hilbert: Recursively Building Formal Proofs with Informal Reasoning
- URL: http://arxiv.org/abs/2509.22819v1
- Date: Fri, 26 Sep 2025 18:24:23 GMT
- Title: Hilbert: Recursively Building Formal Proofs with Informal Reasoning
- Authors: Sumanth Varambally, Thomas Voice, Yanchao Sun, Zhifeng Chen, Rose Yu, Ke Ye,
- Abstract summary: Large Language Models (LLMs) demonstrate impressive mathematical reasoning abilities, but their solutions often contain errors that cannot be automatically verified.<n>We introduce Hilbert, an agentic framework that combines the complementary strengths of informal reasoning and formal verification.<n>Our system orchestrates four components: an informal LLM that excels at mathematical reasoning, a specialized prover LLM optimized for Lean 4 tactics, a formal verifier, and a semantic theorem retriever.
- Score: 38.36481253622752
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) demonstrate impressive mathematical reasoning abilities, but their solutions frequently contain errors that cannot be automatically verified. Formal theorem proving systems such as Lean 4 offer automated verification with complete accuracy, motivating recent efforts to build specialized prover LLMs that generate verifiable proofs in formal languages. However, a significant gap remains: current prover LLMs solve substantially fewer problems than general-purpose LLMs operating in natural language. We introduce Hilbert, an agentic framework that bridges this gap by combining the complementary strengths of informal reasoning and formal verification. Our system orchestrates four components: an informal LLM that excels at mathematical reasoning, a specialized prover LLM optimized for Lean 4 tactics, a formal verifier, and a semantic theorem retriever. Given a problem that the prover is unable to solve, Hilbert employs recursive decomposition to split the problem into subgoals that it solves with the prover or reasoner LLM. It leverages verifier feedback to refine incorrect proofs as necessary. Experimental results demonstrate that Hilbert substantially outperforms existing approaches on key benchmarks, achieving 99.2% on miniF2F, 6.6% points above the best publicly available method. Hilbert achieves the best known result on PutnamBench. It solves 462/660 problems (70.0%), outperforming proprietary approaches like SeedProver (50.4%) and achieving a 422% improvement over the best publicly available baseline. Thus, Hilbert effectively narrows the gap between informal reasoning and formal proof generation.
Related papers
- HERMES: Towards Efficient and Verifiable Mathematical Reasoning in LLMs [32.234133057592935]
Hermes is a tool-assisted agent that interleaves informal reasoning with verified proof steps in Lean systems.<n>We evaluate Hermes on four challenging mathematical reasoning benchmarks using LLMs of varying parameter scales.
arXiv Detail & Related papers (2025-11-24T04:50:18Z) - FormalML: A Benchmark for Evaluating Formal Subgoal Completion in Machine Learning Theory [44.64175433092553]
Large language models (LLMs) have recently demonstrated remarkable progress in formal theorem proving.<n>Yet their ability to serve as practical assistants for mathematicians, filling in missing steps within complex proofs, remains underexplored.<n>We introduce FormalML, a Lean 4 benchmark built from foundational theories of machine learning.
arXiv Detail & Related papers (2025-09-26T14:40:14Z) - Solving Inequality Proofs with Large Language Models [46.71658812761115]
Inequality proving is crucial across diverse scientific and mathematical fields.<n>This makes it a demanding frontier for large language models (LLMs)<n>We release IneqMath, an expert-curated dataset of Olympiad-level inequalities.
arXiv Detail & Related papers (2025-06-09T16:43:38Z) - Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification [56.218970738892764]
Chain-of-Thought prompting has become the de facto method to elicit reasoning capabilities from large language models (LLMs)<n>To mitigate hallucinations in CoT that are notoriously difficult to detect, current methods operate as opaque boxes and do not provide checkable evidence for their judgments, possibly limiting their effectiveness.<n>We propose a retrospective, step-aware formal verification framework $Safe$. Rather than assigning arbitrary scores, we strive to articulate mathematical claims in formal mathematical language Lean 4 at each reasoning step and provide formal proofs to identify hallucinations.
arXiv Detail & Related papers (2025-06-05T03:16:08Z) - APOLLO: Automated LLM and Lean Collaboration for Advanced Formal Reasoning [8.056359341994941]
APOLLO is a modelagnostic pipeline that combines the strengths of the Lean compiler with an LLM's reasoning abilities.<n>A set of agents analyze the proofs, fix the syntax errors, identify the mistakes in the proofs using Lean.<n>The repaired subproofs are recombined and reverified, iterating up to a usercontrolled maximum number of attempts.
arXiv Detail & Related papers (2025-05-09T03:38:31Z) - LeanProgress: Guiding Search for Neural Theorem Proving via Proof Progress Prediction [74.79306773878955]
We introduce LeanProgress, a method that predicts the progress in the proof.<n>Our experiments show that LeanProgress achieves an overall prediction accuracy of 75.1%.
arXiv Detail & Related papers (2025-02-25T07:46:36Z) - Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving [72.8626512877667]
We introduce Goedel-Prover, an open-source language model that achieves state-of-the-art (as of April 5 2025) performance in automated formal proof generation for mathematical problems.<n>First, we train LLMs to convert natural language math problems from the Numina dataset to equivalent formal statements in Lean 4.<n>Next, we develop a large dataset of formal proofs by training a series of provers.<n>Finally, we obtain the dataset Goedel-Pset-v1-solved, which contains proofs for over 800K statements from Goedel-Pset-v1
arXiv Detail & Related papers (2025-02-11T15:27:35Z) - Proof Automation with Large Language Models [6.587933406842906]
Large Language Models (LLMs) have shown promise in automatically generating informal proofs in natural language.
We propose PALM, a novel generate-then-repair approach that first prompts an LLM to generate an initial proof and then leverages targeted symbolic methods to iteratively repair low-level problems.
Our results show that PALM significantly outperforms other state-of-the-art approaches, successfully proving 76.6% to 180.4% more theorems.
arXiv Detail & Related papers (2024-09-22T00:19:27Z) - TheoremLlama: Transforming General-Purpose LLMs into Lean4 Experts [26.98890165420689]
TheoremLlama is an end-to-end framework that trains a general-purpose Lean4 expert.
Our framework achieves cumulative accuracies of 36.48% and 33.61% on MiniF2F-Valid and Test datasets respectively.
arXiv Detail & Related papers (2024-07-03T15:36:18Z) - DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data [65.5290035371111]
We introduce an approach to generate extensive Lean 4 proof data derived from high-school and undergraduate-level mathematical competition problems.
We fine-tune the DeepSeekMath 7B model on this synthetic dataset, which comprises 8 million formal statements with proofs.
Our model successfully proved 5 out of 148 problems in the Lean 4 Formalized International Mathematical Olympiad (FIMO) benchmark, while GPT-4 failed to prove any.
arXiv Detail & Related papers (2024-05-23T09:03:42Z) - LEGO-Prover: Neural Theorem Proving with Growing Libraries [86.1191481712352]
We present LEGO-Prover, which employs a growing skill library containing verified lemmas as skills to augment the capability of LLMs used in theorem proving.
By constructing the proof modularly, LEGO-Prover enables LLMs to utilize existing skills retrieved from the library and to create new skills during the proving process.
Our ablation study indicates that these newly added skills are indeed helpful for proving theorems, resulting in an improvement from a success rate of 47.1% to 50.4%.
arXiv Detail & Related papers (2023-10-01T12:47:59Z) - LeanDojo: Theorem Proving with Retrieval-Augmented Language Models [72.54339382005732]
Large language models (LLMs) have shown promise in proving formal theorems using proof assistants such as Lean.
Existing methods are difficult to reproduce or build on, due to private code, data, and compute requirements.
This paper introduces LeanDojo: an open-source Lean toolkit consisting of toolkits, data, models.
We develop ReProver: an LLM-based prover augmented with retrieval for selecting premises from a vast math library.
arXiv Detail & Related papers (2023-06-27T17:05:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.