Autoformalizer with Tool Feedback
- URL: http://arxiv.org/abs/2510.06857v1
- Date: Wed, 08 Oct 2025 10:25:12 GMT
- Title: Autoformalizer with Tool Feedback
- Authors: Qi Guo, Jianing Wang, Jianfei Zhang, Deyang Kong, Xiangzhou Huang, Xiangyu Xi, Wei Wang, Jingang Wang, Xunliang Cai, Shikun Zhang, Wei Ye,
- Abstract summary: Autoformalization addresses the scarcity of data for Automated Theorem Proving (ATP) by translating mathematical problems from natural language into formal statements.<n>Existing formalizer still struggles to consistently generate valid statements that meet syntactic validity and semantic consistency.<n>We propose the Autoformalizer with Tool Feedback (ATF), a novel approach that incorporates syntactic and consistency information as tools into the formalization process.
- Score: 52.334957386319864
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Autoformalization addresses the scarcity of data for Automated Theorem Proving (ATP) by translating mathematical problems from natural language into formal statements. Efforts in recent work shift from directly prompting large language models to training an end-to-end formalizer model from scratch, achieving remarkable advancements. However, existing formalizer still struggles to consistently generate valid statements that meet syntactic validity and semantic consistency. To address this issue, we propose the Autoformalizer with Tool Feedback (ATF), a novel approach that incorporates syntactic and consistency information as tools into the formalization process. By integrating Lean 4 compilers for syntax corrections and employing a multi-LLMs-as-judge approach for consistency validation, the model is able to adaptively refine generated statements according to the tool feedback, enhancing both syntactic validity and semantic consistency. The training of ATF involves a cold-start phase on synthetic tool-calling data, an expert iteration phase to improve formalization capabilities, and Direct Preference Optimization to alleviate ineffective revisions. Experimental results show that ATF markedly outperforms a range of baseline formalizer models, with its superior performance further validated by human evaluations. Subsequent analysis reveals that ATF demonstrates excellent inference scaling properties. Moreover, we open-source Numina-ATF, a dataset containing 750K synthetic formal statements to facilitate advancements in autoformalization and ATP research.
Related papers
- Automatic Syntax Error Repair for Discrete Controller Synthesis using Large Language Model [8.741815980649667]
This paper presents an automated approach that leverages Large Language Models (LLMs) to repair syntax errors in DCS models.<n>It equips the LLM with DCS-specific domain knowledge, including formal grammar rules and illustrative examples, to guide accurate corrections.<n>It achieves a speedup of 3.46 times compared to human developers.
arXiv Detail & Related papers (2025-12-08T07:57:15Z) - SynClaimEval: A Framework for Evaluating the Utility of Synthetic Data in Long-Context Claim Verification [1.740313383876245]
We introduce SynClaimEval, a framework for evaluating synthetic data utility in long-context claim verification.<n>Our framework examines three dimensions: (i) input characteristics, by varying context length and testing generalization to out-of-domain benchmarks; (ii) synthesis logic, by controlling claim complexity and error type variation; and (iii) explanation quality, measuring the degree to which model explanations provide evidence consistent with predictions.
arXiv Detail & Related papers (2025-11-12T18:36:59Z) - ReForm: Reflective Autoformalization with Prospective Bounded Sequence Optimization [73.0780809974414]
We propose a Reflective Autoformalization method that integrates semantic consistency evaluation into the autoformalization process.<n>This enables the model to iteratively generate formal statements, assess its semantic fidelity, and self-correct identified errors.<n>Experiments show that ReForm achieves an average improvement of 22.6 percentage points over the strongest baselines.
arXiv Detail & Related papers (2025-10-28T16:22:54Z) - PAT-Agent: Autoformalization for Model Checking [17.082027022913998]
PAT-Agent is an end-to-end framework for natural language autoformalization and formal model repair.<n>It combines the generative capabilities of large language models with the rigor of formal verification.
arXiv Detail & Related papers (2025-09-28T06:32:14Z) - Re:Form -- Reducing Human Priors in Scalable Formal Software Verification with RL in LLMs: A Preliminary Study on Dafny [68.00108157244952]
Large Language Models (LLMs) trained with Reinforcement Learning (RL) face a significant challenge: their verification processes are neither reliable nor scalable.<n>A promising yet largely uncharted alternative is formal language-based reasoning.<n>Grounding LLMs in rigorous formal systems where generative models operate in formal language spaces (e.g., Dafny) enables the automatic and mathematically provable verification of their reasoning processes and outcomes.
arXiv Detail & Related papers (2025-07-22T08:13:01Z) - Supporting Software Formal Verification with Large Language Models: An Experimental Study [9.688989142858954]
SpecVerify integrates large language models with formal verification tools.<n>This framework combines Claude 3.5 Sonnet with the ESBMC verifier to form an automated workflow.<n> Evaluated on nine cyber-physical systems from Lockheed Martin, SpecVerify achieves 46.5% verification accuracy, comparable to NASA's CoCoSim.
arXiv Detail & Related papers (2025-07-07T10:30:05Z) - Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs)
Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws.
Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z) - Improving Autoformalization using Type Checking [15.58948808529849]
We analyze both current autoformalization methods and the processes used to evaluate them, focusing specifically on the Lean 4 theorem proving language.<n>We demonstrate that scaling type-check filtering with self-consistency techniques on top of existing methods significantly improves performance, achieving absolute accuracy gains of up to +18.4% on ProofNet.<n>We also release new benchmarks: a new research-level mathematics dataset RLM25, a corrected ProofNet, and ProofNetVerif with labeled correct and incorrect autoformalization pairs for evaluating metrics.
arXiv Detail & Related papers (2024-06-11T13:01:50Z) - CELA: Cost-Efficient Language Model Alignment for CTR Prediction [70.65910069412944]
Click-Through Rate (CTR) prediction holds a paramount position in recommender systems.<n>Recent efforts have sought to mitigate these challenges by integrating Pre-trained Language Models (PLMs)<n>We propose textbfCost-textbfEfficient textbfLanguage Model textbfAlignment (textbfCELA) for CTR prediction.
arXiv Detail & Related papers (2024-05-17T07:43:25Z) - Prompting to Distill: Boosting Data-Free Knowledge Distillation via
Reinforced Prompt [52.6946016535059]
Data-free knowledge distillation (DFKD) conducts knowledge distillation via eliminating the dependence of original training data.
We propose a prompt-based method, termed as PromptDFD, that allows us to take advantage of learned language priors.
As shown in our experiments, the proposed method substantially improves the synthesis quality and achieves considerable improvements on distillation performance.
arXiv Detail & Related papers (2022-05-16T08:56:53Z) - NoiER: An Approach for Training more Reliable Fine-TunedDownstream Task
Models [54.184609286094044]
We propose noise entropy regularisation (NoiER) as an efficient learning paradigm that solves the problem without auxiliary models and additional data.
The proposed approach improved traditional OOD detection evaluation metrics by 55% on average compared to the original fine-tuned models.
arXiv Detail & Related papers (2021-08-29T06:58:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.