ProofFlow: A Dependency Graph Approach to Faithful Proof Autoformalization
- URL: http://arxiv.org/abs/2510.15981v1
- Date: Mon, 13 Oct 2025 10:20:11 GMT
- Title: ProofFlow: A Dependency Graph Approach to Faithful Proof Autoformalization
- Authors: Rafael Cabral, Tuan Manh Do, Xuejun Yu, Wai Ming Tai, Zijin Feng, Xin Shen,
- Abstract summary: Current approaches fail to preserve semantic meaning and logical structure of the original human-written argument.<n>We introduce ProofFlow, a novel pipeline that treats structural fidelity as a primary objective.<n>We present a new benchmark of 184 undergraduate-level problems, manually annotated with step-by-step solutions.<n>We also introduce ProofScore, a new composite metric to evaluate syntactic correctness, semantic faithfulness, and structural fidelity.
- Score: 10.021930888250546
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Proof autoformalization, the task of translating natural language theorems and proofs into machine-verifiable code, is a critical step for integrating large language models into rigorous mathematical workflows. Current approaches focus on producing executable code, but they frequently fail to preserve the semantic meaning and logical structure of the original human-written argument. To address this, we introduce ProofFlow, a novel pipeline that treats structural fidelity as a primary objective. ProofFlow first constructs a directed acyclic graph (DAG) to map the logical dependencies between proof steps. Then, it employs a novel lemma-based approach to systematically formalize each step as an intermediate lemma, preserving the logical structure of the original argument. To facilitate evaluation, we present a new benchmark of 184 undergraduate-level problems, manually annotated with step-by-step solutions and logical dependency graphs, and introduce ProofScore, a new composite metric to evaluate syntactic correctness, semantic faithfulness, and structural fidelity. Experimental results show our pipeline sets a new state-of-the-art for autoformalization, achieving a ProofScore of 0.545, substantially exceeding baselines like full-proof formalization (0.123), which processes the entire proof at once, and step-proof formalization (0.072), which handles each step independently. Our pipeline, benchmark, and score metric are open-sourced to encourage further progress at https://github.com/Huawei-AI4Math/ProofFlow.
Related papers
- Multi-Agent Procedural Graph Extraction with Structural and Logical Refinement [66.51979814832332]
model formulates procedural graph extraction as a multi-round reasoning process with dedicated structural and logical refinement.<n>Experiments demonstrate that model achieves substantial improvements in both structural correctness and logical consistency over strong baselines.
arXiv Detail & Related papers (2026-01-27T04:00:48Z) - Training LLMs with LogicReward for Faithful and Rigorous Reasoning [75.30425553246177]
We propose LogicReward, a reward system that guides model training by enforcing step-level logical correctness with a theorem prover.<n>An 8B model trained on data constructed with LogicReward surpasses GPT-4o and o4-mini by 11.6% and 2% on natural language inference and logical reasoning tasks.
arXiv Detail & Related papers (2025-12-20T03:43:02Z) - Translating Informal Proofs into Formal Proofs Using a Chain of States [20.0011959667642]
We address the problem of translating informal mathematical proofs expressed in natural language into formal proofs in Lean4 under a constrained computational budget.<n>We first extract a Chain of States (CoS), a sequence of intermediate formal proof states aligned with the logical structure of the informal argument.<n>We then generate tactics to transition between adjacent states in the CoS, thereby constructing the full formal proof.
arXiv Detail & Related papers (2025-12-11T06:08:34Z) - BRIDGE: Building Representations In Domain Guided Program Verification [67.36686119518441]
BRIDGE decomposes verification into three interconnected domains: Code, Specifications, and Proofs.<n>We show that this approach substantially improves both accuracy and efficiency beyond standard error feedback methods.
arXiv Detail & Related papers (2025-11-26T06:39:19Z) - Are Language Models Efficient Reasoners? A Perspective from Logic Programming [109.47572890883248]
Modern language models (LMs) exhibit strong deductive reasoning capabilities, yet standard evaluations emphasize correctness while overlooking a key aspect of human-like reasoning: efficiency.<n>We propose a framework for assessing LM reasoning efficiency through the lens of logic programming.
arXiv Detail & Related papers (2025-10-29T15:30:31Z) - ProofBridge: Auto-Formalization of Natural Language Proofs in Lean via Joint Embeddings [9.764411884491052]
We present ProofBridge, a framework for automatically translating entire NL theorems and proofs into Lean 4.<n>At its core is a joint embedding model that aligns NL and FL (NL-FL) theorem-proof pairs in a shared semantic space.<n>Our training ensures NL-FL theorems are mapped close together in this space if and only if the NL-FL pairs are semantically equivalent.
arXiv Detail & Related papers (2025-10-17T14:20:50Z) - Typed Chain-of-Thought: A Curry-Howard Framework for Verifying LLM Reasoning [0.0]
Chain-of-Thought (CoT) prompting enhances the reasoning capabilities of large language models.<n>We propose a novel theoretical lens for this problem grounded in the Curry-Howard correspondence.<n>We operationalise this analogy, presenting methods to extract and map the informal, natural language steps of CoT into a formal, typed proof structure.
arXiv Detail & Related papers (2025-10-01T16:06:40Z) - StepProof: Step-by-step verification of natural language mathematical proofs [16.150265021594088]
We propose StepProof, a novel autoformalization method for granular, step-by-step verification.<n>StepProof breaks down complete proofs into multiple verifiable subproofs, enabling sentence-level verification.<n>We show that StepProof significantly improves proof success rates and efficiency compared to traditional methods.
arXiv Detail & Related papers (2025-06-12T10:31:23Z) - ProofAug: Efficient Neural Theorem Proving via Fine-grained Proof Structure Analysis [50.020850767257095]
We propose ProofAug, a procedure that equips LLMs with automation methods at various granularities.<n>Our method is validated on the miniF2F benchmark using the open-source deep-math-7b-base model and the Isabelle proof assistant.<n>We also implement a Lean 4 version of ProofAug that can improve the pass@1 performance of Kimina-Prover-seek-Distill-1.5B from 44.3% to 50.4%.
arXiv Detail & Related papers (2025-01-30T12:37:06Z) - Alchemy: Amplifying Theorem-Proving Capability through Symbolic Mutation [71.32761934724867]
This work proposes Alchemy, a framework for data synthesis that constructs formal theorems through symbolic mutation.<n>For each candidate theorem in Mathlib, we identify all invocable theorems that can be used to rewrite or apply to it.<n>As a result, our method increases the number of theorems in Mathlib by an order of magnitude, from 110k to 6M.
arXiv Detail & Related papers (2024-10-21T08:04:21Z) - Improving Autoformalization using Type Checking [15.58948808529849]
We analyze both current autoformalization methods and the processes used to evaluate them, focusing specifically on the Lean 4 theorem proving language.<n>We demonstrate that scaling type-check filtering with self-consistency techniques on top of existing methods significantly improves performance, achieving absolute accuracy gains of up to +18.4% on ProofNet.<n>We also release new benchmarks: a new research-level mathematics dataset RLM25, a corrected ProofNet, and ProofNetVerif with labeled correct and incorrect autoformalization pairs for evaluating metrics.
arXiv Detail & Related papers (2024-06-11T13:01:50Z) - MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data [85.50740598523818]
MUSTARD is a framework that masters uniform synthesis of theorem and proof data of high quality and diversity.
We present a theorem-and-proof benchmark MUSTARDSAUCE with 5,866 valid data points.
We perform extensive analysis and demonstrate that MUSTARD generates validated high-quality step-by-step data.
arXiv Detail & Related papers (2024-02-14T05:57:58Z) - Generating Natural Language Proofs with Verifier-Guided Search [74.9614610172561]
We present a novel stepwise method NLProofS (Natural Language Proof Search)
NLProofS learns to generate relevant steps conditioning on the hypothesis.
It achieves state-of-the-art performance on EntailmentBank and RuleTaker.
arXiv Detail & Related papers (2022-05-25T02:22:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.