Towards AI-Assisted Synthesis of Verified Dafny Methods
- URL: http://arxiv.org/abs/2402.00247v2
- Date: Mon, 10 Jun 2024 20:14:41 GMT
- Title: Towards AI-Assisted Synthesis of Verified Dafny Methods
- Authors: Md Rakib Hossain Misu, Cristina V. Lopes, Iris Ma, James Noble,
- Abstract summary: Existing large language models show a severe lack of proficiency in verified programming.
We show how to improve two pretrained models' proficiency in the Dafny verification-aware language.
- Score: 1.0187122752343796
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models show great promise in many domains, including programming. A promise is easy to make but hard to keep, and language models often fail to keep their promises, generating erroneous code. A promising avenue to keep models honest is to incorporate formal verification: generating programs' specifications as well as code so that the code can be proved correct with respect to the specifications. Unfortunately, existing large language models show a severe lack of proficiency in verified programming. In this paper, we demonstrate how to improve two pretrained models' proficiency in the Dafny verification-aware language. Using 178 problems from the MBPP dataset, we prompt two contemporary models (GPT-4 and PaLM-2) to synthesize Dafny methods. We use three different types of prompts: a direct Contextless prompt; a Signature prompt that includes a method signature and test cases, and a Chain of Thought (CoT) prompt that decomposes the problem into steps and includes retrieval augmentation generated example problems and solutions. Our results show that GPT-4 performs better than PaLM-2 on these tasks and that both models perform best with the retrieval augmentation generated CoT prompt. GPT-4 was able to generate verified, human-evaluated, Dafny methods for 58% of the problems, however, GPT-4 managed only 19% of the problems with the Contextless prompt, and even fewer (10%) for the Signature prompt. We are thus able to contribute 153 verified Dafny solutions to MBPP problems, 50 that we wrote manually, and 103 synthesized by GPT-4. Our results demonstrate that the benefits of formal program verification are now within reach of code generating large language models...
Related papers
- Building A Proof-Oriented Programmer That Is 64% Better Than GPT-4o Under Data Scarsity [0.5370906227996627]
We present the first on synthetic data augmentation for project level proof oriented programming for both generation and repair.
Our method addresses data scarcity by synthesizing basic proof-oriented programming problems for proficiency in that language.
We show that our fine-tuned 14B parameter model, PoPilot, can exceed the performance of the models that outperforms GPT-4o in project-level proof-oriented programming.
arXiv Detail & Related papers (2025-02-17T15:24:11Z) - Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving [72.8626512877667]
We introduce Goedel-Prover, an open-source large language model (LLM) that achieves the state-of-the-art (SOTA) performance in automated formal proof generation for mathematical problems.
arXiv Detail & Related papers (2025-02-11T15:27:35Z) - Towards Neural Synthesis for SMT-Assisted Proof-Oriented Programming [8.34623776815378]
We curate a dataset of 600K lines of open-source F* programs and proofs.
This dataset includes software used in production systems ranging from Windows and Linux to Python and Firefox.
We investigate the use of AI to synthesize programs and their proofs in F*, with promising results.
arXiv Detail & Related papers (2024-05-03T00:14:33Z) - MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data [85.50740598523818]
MUSTARD is a framework that masters uniform synthesis of theorem and proof data of high quality and diversity.
We present a theorem-and-proof benchmark MUSTARDSAUCE with 5,866 valid data points.
We perform extensive analysis and demonstrate that MUSTARD generates validated high-quality step-by-step data.
arXiv Detail & Related papers (2024-02-14T05:57:58Z) - Leveraging Print Debugging to Improve Code Generation in Large Language
Models [63.63160583432348]
Large language models (LLMs) have made significant progress in code generation tasks.
But their performance in tackling programming problems with complex data structures and algorithms remains suboptimal.
We propose an in-context learning approach that guides LLMs to debug by using a "print debug" method.
arXiv Detail & Related papers (2024-01-10T18:37:59Z) - Prompt Engineering or Fine-Tuning: An Empirical Assessment of LLMs for Code [7.760653867600283]
We evaluate GPT-4 using three prompt engineering strategies -- basic prompting, in-context learning, and task-specific prompting.
We compare it against 17 fine-tuned models across three code-related tasks: code summarization, generation, and translation.
arXiv Detail & Related papers (2023-10-11T00:21:00Z) - Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with
Code-based Self-Verification [40.83776920225375]
OpenAI's latest version of GPT-4, known as GPT-4 Code Interpreter, shows remarkable performance on challenging math datasets.
We propose a novel and effective prompting method, explicit ulinecode-based ulineself-ulineverification(CSV)
We achieve an impressive zero-shot accuracy on MATH dataset textbf(53.9% $to$ 84.3%).
arXiv Detail & Related papers (2023-08-15T17:58:45Z) - Teaching Large Language Models to Self-Debug [62.424077000154945]
Large language models (LLMs) have achieved impressive performance on code generation.
We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
arXiv Detail & Related papers (2023-04-11T10:43:43Z) - Ask Me Anything: A simple strategy for prompting language models [24.294416731247427]
Large language models (LLMs) transfer well to new tasks out-of-the-box simply given a natural language prompt.
We develop an understanding of the effective prompt formats, finding that question-answering (QA) prompts tend to outperform those that restrict the model outputs.
We apply the collected prompts to obtain several noisy votes for the input's true label.
We find that the prompts can have very different accuracies and complex dependencies.
arXiv Detail & Related papers (2022-10-05T17:59:45Z) - Finetuned Language Models Are Zero-Shot Learners [67.70352207685558]
We show that instruction tuning boosts zero-shot performance on unseen tasks.
We take a 137B parameter pretrained language model and instruction-tune it on over 60 NLP tasks verbalized via natural language instruction templates.
We evaluate this instruction-tuned model, which we call FLAN, on unseen task types.
arXiv Detail & Related papers (2021-09-03T17:55:52Z) - TuringAdvice: A Generative and Dynamic Evaluation of Language Use [90.3029315711237]
We propose TuringAdvice, a new challenge task and dataset for language understanding models.
Given a written situation that a real person is currently facing, a model must generate helpful advice in natural language.
Empirical results show that today's models struggle at TuringAdvice.
arXiv Detail & Related papers (2020-04-07T18:00:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.