Related papers: Improving Autoformalization using Type Checking

Improving Autoformalization using Type Checking

URL: http://arxiv.org/abs/2406.07222v1
Date: Tue, 11 Jun 2024 13:01:50 GMT
Title: Improving Autoformalization using Type Checking
Authors: Auguste Poiroux, Gail Weiss, Viktor Kunčak, Antoine Bosselut,
Abstract summary: Large language models show promise for autoformalization, the task of automatically translating natural language into formal languages. The last reported state-of-the-art performance on the ProofNet formalization benchmark for the Lean proof assistant, achieved using Codex for Lean 3, only showed successful formalization of 16.1% of informal statements. Our analysis shows that the performance of these models is largely limited by their inability to generate formal statements that successfully type-check.
Score: 15.58948808529849
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models show promise for autoformalization, the task of automatically translating natural language into formal languages. However, current autoformalization methods remain limited. The last reported state-of-the-art performance on the ProofNet formalization benchmark for the Lean proof assistant, achieved using Codex for Lean 3, only showed successful formalization of 16.1% of informal statements. Similarly, our evaluation of GPT-4o for Lean 4 only produces successful translations 34.9% of the time. Our analysis shows that the performance of these models is largely limited by their inability to generate formal statements that successfully type-check (i.e., are syntactically correct and consistent with types) - with a whopping 86.6% of GPT-4o errors starting from a type-check failure. In this work, we propose a method to fix this issue through decoding with type-check filtering, where we initially sample a diverse set of candidate formalizations for an informal statement, then use the Lean proof assistant to filter out candidates that do not type-check. Using GPT-4o as a base model, and combining our method with self-consistency, we obtain a +18.3% absolute increase in formalization accuracy, and achieve a new state-of-the-art of 53.2% on ProofNet with Lean 4.

Related papers

Lean-ing on Quality: How High-Quality Data Beats Diverse Multilingual Data in AutoFormalization [1.204553980682492]
We introduce a novel methodology that leverages back-translation with hand-curated prompts to enhance the mathematical capabilities of language models. We show that our approaches surpass the performance of fine-tuning with extensive multilingual datasets. Taken together, our methods show a promising new approach to significantly reduce the resources required to formalize, thereby accelerating AI for math.
arXiv Detail & Related papers (2025-02-18T19:16:54Z)
Autoformalize Mathematical Statements by Symbolic Equivalence and Semantic Consistency [22.86318578119266]
We introduce a novel framework that scores and selects the best result from k autoformalization candidates based on symbolic equivalence and semantic consistency. Our experiments on the MATH and miniF2F datasets demonstrate that our approach significantly enhances autoformalization accuracy.
arXiv Detail & Related papers (2024-10-28T11:37:39Z)
Automated Proof Generation for Rust Code via Self-Evolution [69.25795662658356]
We introduce SAFE, a novel framework that overcomes the lack of human-written proof to enable automated proof generation of Rust code. We demonstrate superior efficiency and precision compared to GPT-4o. This advancement leads to a significant improvement in performance, achieving a 70.50% accuracy rate in a benchmark crafted by human experts.
arXiv Detail & Related papers (2024-10-21T08:15:45Z)
Alchemy: Amplifying Theorem-Proving Capability through Symbolic Mutation [71.32761934724867]
This work proposes Alchemy, a framework for data synthesis that constructs formal theorems through symbolic mutation. For each candidate theorem in Mathlib, we identify all invocable theorems that can be used to rewrite or apply to it. As a result, our method increases the number of theorems in Mathlib by an order of magnitude, from 110k to 6M.
arXiv Detail & Related papers (2024-10-21T08:04:21Z)
FormalAlign: Automated Alignment Evaluation for Autoformalization [12.674957231226871]
textscFormalAlign trains on both the autoformalization sequence generation task and the representational alignment between input and output. evaluated across four benchmarks, textscFormalAlign demonstrates superior performance. This effective alignment evaluation significantly reduces the need for manual verification.
arXiv Detail & Related papers (2024-10-14T03:58:35Z)
Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning. LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors. We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z)
Lean-STaR: Learning to Interleave Thinking and Proving [53.923617816215774]
We present Lean-STaR, a framework for training language models to produce informal thoughts prior to each step of a proof. Lean-STaR achieves state-of-the-art results on the miniF2F-test benchmark within the Lean theorem proving environment.
arXiv Detail & Related papers (2024-07-14T01:43:07Z)
LLM Critics Help Catch Bugs in Mathematics: Towards a Better Mathematical Verifier with Natural Language Feedback [71.95402654982095]
We propose Math-Minos, a natural language feedback-enhanced verifier. Our experiments reveal that a small set of natural language feedback can significantly boost the performance of the verifier.
arXiv Detail & Related papers (2024-06-20T06:42:27Z)
Evaluating LLM-driven User-Intent Formalization for Verification-Aware Languages [6.0608817611709735]
We propose a metric for evaluating the quality of specifications for verification-aware languages. We show that our metric agrees closely on a human-labeled dataset of Dafny specifications for the popular MBPP code-generation benchmark. We also outline formal verification challenges that need to be addressed to apply the technique more widely.
arXiv Detail & Related papers (2024-06-14T06:52:08Z)
Process-Driven Autoformalization in Lean 4 [30.056591518828554]
We develop a benchmark to evaluate the autoformalization capabilities of large language models. We also introduce a model that leverages the precise feedback from Lean 4 compilers to enhance autoformalization. Our experiments demonstrate that the PSV method improves autoformalization, enabling higher accuracy using less filtered training data.
arXiv Detail & Related papers (2024-06-04T03:48:08Z)
Bisimulation Learning [55.859538562698496]
We compute finite bisimulations of state transition systems with large, possibly infinite state space. Our technique yields faster verification results than alternative state-of-the-art tools in practice.
arXiv Detail & Related papers (2024-05-24T17:11:27Z)
DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data [65.5290035371111]
We introduce an approach to generate extensive Lean 4 proof data derived from high-school and undergraduate-level mathematical competition problems. We fine-tune the DeepSeekMath 7B model on this synthetic dataset, which comprises 8 million formal statements with proofs. Our model successfully proved 5 out of 148 problems in the Lean 4 Formalized International Mathematical Olympiad (FIMO) benchmark, while GPT-4 failed to prove any.
arXiv Detail & Related papers (2024-05-23T09:03:42Z)
Calibrating the Confidence of Large Language Models by Eliciting Fidelity [52.47397325111864]
Large language models optimized with techniques like RLHF have achieved good alignment in being helpful and harmless. Post-alignment, these language models often exhibit overconfidence, where the expressed confidence does not accurately calibrate with their correctness rate. We propose a plug-and-play method to estimate the confidence of language models.
arXiv Detail & Related papers (2024-04-03T11:36:12Z)
GPT-HateCheck: Can LLMs Write Better Functional Tests for Hate Speech Detection? [50.53312866647302]
HateCheck is a suite for testing fine-grained model functionalities on synthesized data. We propose GPT-HateCheck, a framework to generate more diverse and realistic functional tests from scratch. Crowd-sourced annotation demonstrates that the generated test cases are of high quality.
arXiv Detail & Related papers (2024-02-23T10:02:01Z)
Towards AI-Assisted Synthesis of Verified Dafny Methods [1.0187122752343796]
Existing large language models show a severe lack of proficiency in verified programming. We show how to improve two pretrained models' proficiency in the Dafny verification-aware language.
arXiv Detail & Related papers (2024-02-01T00:07:23Z)
Benchmarking and Improving Generator-Validator Consistency of Language Models [82.73914625520686]
inconsistency between generating and validating an answer is prevalent in language models (LMs) Even GPT-4, a state-of-the-art LM, is GV-consistent only 76% of the time. We find that this approach improves GV-consistency of Alpaca-30B from 60% to 93%.
arXiv Detail & Related papers (2023-10-03T07:23:22Z)
Autoformalization with Large Language Models [22.86710743804944]
A successful autoformalization system could advance the fields of formal verification, program synthesis, and artificial intelligence. We show large language models provide new prospects towards this goal. Our methodology results in a new state-of-the-art result on the MiniF2F theorem proving benchmark, improving the proof rate from $29.6%$ to $35.2%$.
arXiv Detail & Related papers (2022-05-25T09:53:30Z)
TuringAdvice: A Generative and Dynamic Evaluation of Language Use [90.3029315711237]
We propose TuringAdvice, a new challenge task and dataset for language understanding models. Given a written situation that a real person is currently facing, a model must generate helpful advice in natural language. Empirical results show that today's models struggle at TuringAdvice.
arXiv Detail & Related papers (2020-04-07T18:00:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.