Related papers: ProofBridge: Auto-Formalization of Natural Language Proofs in Lean via Joint Embeddings

ProofBridge: Auto-Formalization of Natural Language Proofs in Lean via Joint Embeddings

URL: http://arxiv.org/abs/2510.15681v1
Date: Fri, 17 Oct 2025 14:20:50 GMT
Title: ProofBridge: Auto-Formalization of Natural Language Proofs in Lean via Joint Embeddings
Authors: Prithwish Jana, Kaan Kale, Ahmet Ege Tanriverdi, Cruise Song, Sriram Vishwanath, Vijay Ganesh,
Abstract summary: We present ProofBridge, a framework for automatically translating entire NL theorems and proofs into Lean 4.<n>At its core is a joint embedding model that aligns NL and FL (NL-FL) theorem-proof pairs in a shared semantic space.<n>Our training ensures NL-FL theorems are mapped close together in this space if and only if the NL-FL pairs are semantically equivalent.
Score: 9.764411884491052
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Translating human-written mathematical theorems and proofs from natural language (NL) into formal languages (FLs) like Lean 4 has long been a significant challenge for AI. Most state-of-the-art methods address this separately, first translating theorems and then generating proofs, creating a fundamental disconnect vis-a-vis true proof auto-formalization. This two-step process and its limitations were evident even in AlphaProof's silver-medal performance at the 2024 IMO, where problem statements needed manual translation before automated proof synthesis. We present ProofBridge, a unified framework for automatically translating entire NL theorems and proofs into Lean 4. At its core is a joint embedding model that aligns NL and FL (NL-FL) theorem-proof pairs in a shared semantic space, enabling cross-modal retrieval of semantically relevant FL examples to guide translation. Our training ensures that NL-FL theorems (and their proofs) are mapped close together in this space if and only if the NL-FL pairs are semantically equivalent. ProofBridge integrates retrieval-augmented fine-tuning with iterative proof repair, leveraging Lean's type checker and semantic equivalence feedback to ensure both syntactic correctness and semantic fidelity. Experiments show substantial improvements in proof auto-formalization over strong baselines (including GPT-5, Gemini-2.5, Kimina-Prover, DeepSeek-Prover), with our retrieval-augmented approach yielding significant gains in semantic correctness (SC, via proving bi-directional equivalence) and type correctness (TC, via type-checking theorem+proof) across pass@k metrics on miniF2F-Test-PF, a dataset we curated. In particular, ProofBridge improves cross-modal retrieval quality by up to 3.28x Recall@1 over all-MiniLM-L6-v2, and achieves +31.14% SC and +1.64% TC (pass@32) compared to the baseline Kimina-Prover-RL-1.7B.

Related papers

PBLean: Pseudo-Boolean Proof Certificates for Lean 4 [27.126691338850254]
We present PBLean, a method for importing VeriPB pseudo-Boolean (PB) proof certificates into Lean 4.<n>Key to our approach is reflection: a Boolean checker function whose soundness is fully proved in Lean and executed as compiled native code.<n>Our checker supports all VeriPB kernel rules, including cutting-planes and proof-by-contradiction subproofs.
arXiv Detail & Related papers (2026-02-09T14:13:30Z)
BRIDGE: Building Representations In Domain Guided Program Verification [67.36686119518441]
BRIDGE decomposes verification into three interconnected domains: Code, Specifications, and Proofs.<n>We show that this approach substantially improves both accuracy and efficiency beyond standard error feedback methods.
arXiv Detail & Related papers (2025-11-26T06:39:19Z)
ProofFlow: A Dependency Graph Approach to Faithful Proof Autoformalization [10.021930888250546]
Current approaches fail to preserve semantic meaning and logical structure of the original human-written argument.<n>We introduce ProofFlow, a novel pipeline that treats structural fidelity as a primary objective.<n>We present a new benchmark of 184 undergraduate-level problems, manually annotated with step-by-step solutions.<n>We also introduce ProofScore, a new composite metric to evaluate syntactic correctness, semantic faithfulness, and structural fidelity.
arXiv Detail & Related papers (2025-10-13T10:20:11Z)
Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification [56.218970738892764]
Chain-of-Thought prompting has become the de facto method to elicit reasoning capabilities from large language models (LLMs)<n>To mitigate hallucinations in CoT that are notoriously difficult to detect, current methods operate as opaque boxes and do not provide checkable evidence for their judgments, possibly limiting their effectiveness.<n>We propose a retrospective, step-aware formal verification framework $Safe$. Rather than assigning arbitrary scores, we strive to articulate mathematical claims in formal mathematical language Lean 4 at each reasoning step and provide formal proofs to identify hallucinations.
arXiv Detail & Related papers (2025-06-05T03:16:08Z)
Hierarchical Attention Generates Better Proofs [8.676187819105298]
We introduce textbfHierarchical Attention, a regularization method that aligns attention mechanisms with mathematical reasoning structures.<n>Our approach establishes a five-level hierarchy from foundational elements to high-level concepts, ensuring structured information flow in proof generation.
arXiv Detail & Related papers (2025-04-27T10:35:05Z)
Automated Proof Generation for Rust Code via Self-Evolution [69.25795662658356]
We introduce SAFE, a framework that overcomes the lack of human-written snippets to enable automated proof generation of Rust code.<n> SAFE re-purposes the large number of synthesized incorrect proofs to train the self-ging capability of the fine-tuned models.<n>We achieve a 52.52% accuracy rate in a benchmark crafted by human experts, a significant leap over GPT-4o's performance of 14.39%.
arXiv Detail & Related papers (2024-10-21T08:15:45Z)
Proof Automation with Large Language Models [6.587933406842906]
Large Language Models (LLMs) have shown promise in automatically generating informal proofs in natural language. We propose PALM, a novel generate-then-repair approach that first prompts an LLM to generate an initial proof and then leverages targeted symbolic methods to iteratively repair low-level problems. Our results show that PALM significantly outperforms other state-of-the-art approaches, successfully proving 76.6% to 180.4% more theorems.
arXiv Detail & Related papers (2024-09-22T00:19:27Z)
Lean-STaR: Learning to Interleave Thinking and Proving [53.923617816215774]
We present Lean-STaR, a framework for training language models to produce informal thoughts prior to each step of a proof.<n>Lean-STaR achieves state-of-the-art results on the miniF2F-test benchmark within the Lean theorem proving environment.
arXiv Detail & Related papers (2024-07-14T01:43:07Z)
Improving Autoformalization using Type Checking [15.58948808529849]
We analyze both current autoformalization methods and the processes used to evaluate them, focusing specifically on the Lean 4 theorem proving language.<n>We demonstrate that scaling type-check filtering with self-consistency techniques on top of existing methods significantly improves performance, achieving absolute accuracy gains of up to +18.4% on ProofNet.<n>We also release new benchmarks: a new research-level mathematics dataset RLM25, a corrected ProofNet, and ProofNetVerif with labeled correct and incorrect autoformalization pairs for evaluating metrics.
arXiv Detail & Related papers (2024-06-11T13:01:50Z)
DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data [65.5290035371111]
We introduce an approach to generate extensive Lean 4 proof data derived from high-school and undergraduate-level mathematical competition problems. We fine-tune the DeepSeekMath 7B model on this synthetic dataset, which comprises 8 million formal statements with proofs. Our model successfully proved 5 out of 148 problems in the Lean 4 Formalized International Mathematical Olympiad (FIMO) benchmark, while GPT-4 failed to prove any.
arXiv Detail & Related papers (2024-05-23T09:03:42Z)
Generating Natural Language Proofs with Verifier-Guided Search [74.9614610172561]
We present a novel stepwise method NLProofS (Natural Language Proof Search) NLProofS learns to generate relevant steps conditioning on the hypothesis. It achieves state-of-the-art performance on EntailmentBank and RuleTaker.
arXiv Detail & Related papers (2022-05-25T02:22:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.