Improving Autoformalization using Type Checking
- URL: http://arxiv.org/abs/2406.07222v1
- Date: Tue, 11 Jun 2024 13:01:50 GMT
- Title: Improving Autoformalization using Type Checking
- Authors: Auguste Poiroux, Gail Weiss, Viktor KunĨak, Antoine Bosselut,
- Abstract summary: Large language models show promise for autoformalization, the task of automatically translating natural language into formal languages.
The last reported state-of-the-art performance on the ProofNet formalization benchmark for the Lean proof assistant, achieved using Codex for Lean 3, only showed successful formalization of 16.1% of informal statements.
Our analysis shows that the performance of these models is largely limited by their inability to generate formal statements that successfully type-check.
- Score: 15.58948808529849
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models show promise for autoformalization, the task of automatically translating natural language into formal languages. However, current autoformalization methods remain limited. The last reported state-of-the-art performance on the ProofNet formalization benchmark for the Lean proof assistant, achieved using Codex for Lean 3, only showed successful formalization of 16.1% of informal statements. Similarly, our evaluation of GPT-4o for Lean 4 only produces successful translations 34.9% of the time. Our analysis shows that the performance of these models is largely limited by their inability to generate formal statements that successfully type-check (i.e., are syntactically correct and consistent with types) - with a whopping 86.6% of GPT-4o errors starting from a type-check failure. In this work, we propose a method to fix this issue through decoding with type-check filtering, where we initially sample a diverse set of candidate formalizations for an informal statement, then use the Lean proof assistant to filter out candidates that do not type-check. Using GPT-4o as a base model, and combining our method with self-consistency, we obtain a +18.3% absolute increase in formalization accuracy, and achieve a new state-of-the-art of 53.2% on ProofNet with Lean 4.
Related papers
- Automated Proof Generation for Rust Code via Self-Evolution [69.25795662658356]
We introduce SAFE, a novel framework that overcomes the lack of human-written proof to enable automated proof generation of Rust code.
We demonstrate superior efficiency and precision compared to GPT-4o.
This advancement leads to a significant improvement in performance, achieving a 70.50% accuracy rate in a benchmark crafted by human experts.
arXiv Detail & Related papers (2024-10-21T08:15:45Z) - FormalAlign: Automated Alignment Evaluation for Autoformalization [12.674957231226871]
textscFormalAlign trains on both the autoformalization sequence generation task and the representational alignment between input and output.
evaluated across four benchmarks, textscFormalAlign demonstrates superior performance.
This effective alignment evaluation significantly reduces the need for manual verification.
arXiv Detail & Related papers (2024-10-14T03:58:35Z) - Process-Driven Autoformalization in Lean 4 [30.056591518828554]
We develop a benchmark to evaluate the autoformalization capabilities of large language models.
We also introduce a model that leverages the precise feedback from Lean 4 compilers to enhance autoformalization.
Our experiments demonstrate that the PSV method improves autoformalization, enabling higher accuracy using less filtered training data.
arXiv Detail & Related papers (2024-06-04T03:48:08Z) - DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data [65.5290035371111]
We introduce an approach to generate extensive Lean 4 proof data derived from high-school and undergraduate-level mathematical competition problems.
We fine-tune the DeepSeekMath 7B model on this synthetic dataset, which comprises 8 million formal statements with proofs.
Our model successfully proved 5 out of 148 problems in the Lean 4 Formalized International Mathematical Olympiad (FIMO) benchmark, while GPT-4 failed to prove any.
arXiv Detail & Related papers (2024-05-23T09:03:42Z) - Calibrating the Confidence of Large Language Models by Eliciting Fidelity [52.47397325111864]
Large language models optimized with techniques like RLHF have achieved good alignment in being helpful and harmless.
Post-alignment, these language models often exhibit overconfidence, where the expressed confidence does not accurately calibrate with their correctness rate.
We propose a plug-and-play method to estimate the confidence of language models.
arXiv Detail & Related papers (2024-04-03T11:36:12Z) - GPT-HateCheck: Can LLMs Write Better Functional Tests for Hate Speech Detection? [50.53312866647302]
HateCheck is a suite for testing fine-grained model functionalities on synthesized data.
We propose GPT-HateCheck, a framework to generate more diverse and realistic functional tests from scratch.
Crowd-sourced annotation demonstrates that the generated test cases are of high quality.
arXiv Detail & Related papers (2024-02-23T10:02:01Z) - Towards AI-Assisted Synthesis of Verified Dafny Methods [1.0187122752343796]
Existing large language models show a severe lack of proficiency in verified programming.
We show how to improve two pretrained models' proficiency in the Dafny verification-aware language.
arXiv Detail & Related papers (2024-02-01T00:07:23Z) - Benchmarking and Improving Generator-Validator Consistency of Language
Models [82.73914625520686]
inconsistency between generating and validating an answer is prevalent in language models (LMs)
Even GPT-4, a state-of-the-art LM, is GV-consistent only 76% of the time.
We find that this approach improves GV-consistency of Alpaca-30B from 60% to 93%.
arXiv Detail & Related papers (2023-10-03T07:23:22Z) - Autoformalization with Large Language Models [22.86710743804944]
A successful autoformalization system could advance the fields of formal verification, program synthesis, and artificial intelligence.
We show large language models provide new prospects towards this goal.
Our methodology results in a new state-of-the-art result on the MiniF2F theorem proving benchmark, improving the proof rate from $29.6%$ to $35.2%$.
arXiv Detail & Related papers (2022-05-25T09:53:30Z) - TuringAdvice: A Generative and Dynamic Evaluation of Language Use [90.3029315711237]
We propose TuringAdvice, a new challenge task and dataset for language understanding models.
Given a written situation that a real person is currently facing, a model must generate helpful advice in natural language.
Empirical results show that today's models struggle at TuringAdvice.
arXiv Detail & Related papers (2020-04-07T18:00:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.