Related papers: Towards AI-Assisted Synthesis of Verified Dafny Methods

Towards AI-Assisted Synthesis of Verified Dafny Methods

URL: http://arxiv.org/abs/2402.00247v2
Date: Mon, 10 Jun 2024 20:14:41 GMT
Title: Towards AI-Assisted Synthesis of Verified Dafny Methods
Authors: Md Rakib Hossain Misu, Cristina V. Lopes, Iris Ma, James Noble,
Abstract summary: Existing large language models show a severe lack of proficiency in verified programming. We show how to improve two pretrained models' proficiency in the Dafny verification-aware language.
Score: 1.0187122752343796
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models show great promise in many domains, including programming. A promise is easy to make but hard to keep, and language models often fail to keep their promises, generating erroneous code. A promising avenue to keep models honest is to incorporate formal verification: generating programs' specifications as well as code so that the code can be proved correct with respect to the specifications. Unfortunately, existing large language models show a severe lack of proficiency in verified programming. In this paper, we demonstrate how to improve two pretrained models' proficiency in the Dafny verification-aware language. Using 178 problems from the MBPP dataset, we prompt two contemporary models (GPT-4 and PaLM-2) to synthesize Dafny methods. We use three different types of prompts: a direct Contextless prompt; a Signature prompt that includes a method signature and test cases, and a Chain of Thought (CoT) prompt that decomposes the problem into steps and includes retrieval augmentation generated example problems and solutions. Our results show that GPT-4 performs better than PaLM-2 on these tasks and that both models perform best with the retrieval augmentation generated CoT prompt. GPT-4 was able to generate verified, human-evaluated, Dafny methods for 58% of the problems, however, GPT-4 managed only 19% of the problems with the Contextless prompt, and even fewer (10%) for the Signature prompt. We are thus able to contribute 153 verified Dafny solutions to MBPP problems, 50 that we wrote manually, and 103 synthesized by GPT-4. Our results demonstrate that the benefits of formal program verification are now within reach of code generating large language models...

Related papers

Type-Constrained Code Generation with Language Models [51.03439021895432]
Large language models (LLMs) produce uncompilable output because their next-token inference procedure does not model formal aspects of code. We introduce a type-constrained decoding approach that leverages type systems to guide code generation. Our approach reduces compilation errors by more than half and increases functional correctness in code synthesis, translation, and repair tasks.
arXiv Detail & Related papers (2025-04-12T15:03:00Z)
Building A Proof-Oriented Programmer That Is 64% Better Than GPT-4o Under Data Scarcity [0.5370906227996627]
We present the first on synthetic data augmentation for project level proof oriented programming for both generation and repair. Our method addresses data scarcity by synthesizing basic proof-oriented programming problems for proficiency in that language. We show that our fine-tuned 14B parameter model, PoPilot, can exceed the performance of the models that outperforms GPT-4o in project-level proof-oriented programming.
arXiv Detail & Related papers (2025-02-17T15:24:11Z)
Evaluating the Ability of Large Language Models to Generate Verifiable Specifications in VeriFast [5.019791860882564]
Large language models (LLMs) have shown promise in a number of software engineering activities. This paper explores the effectiveness of OpenAI's GPT models in generating specifications based on separation logic.
arXiv Detail & Related papers (2024-11-04T17:44:11Z)
Improving Autoformalization using Type Checking [15.58948808529849]
Large language models show promise for autoformalization, the task of automatically translating natural language into formal languages. The last reported state-of-the-art performance on the ProofNet formalization benchmark for the Lean proof assistant, achieved using Codex for Lean 3, only showed successful formalization of 16.1% of informal statements. Our analysis shows that the performance of these models is largely limited by their inability to generate formal statements that successfully type-check.
arXiv Detail & Related papers (2024-06-11T13:01:50Z)
Towards Neural Synthesis for SMT-Assisted Proof-Oriented Programming [8.34623776815378]
We curate a dataset of 600K lines of open-source F* programs and proofs. This dataset includes software used in production systems ranging from Windows and Linux to Python and Firefox. We investigate the use of AI to synthesize programs and their proofs in F*, with promising results.
arXiv Detail & Related papers (2024-05-03T00:14:33Z)
Do Neutral Prompts Produce Insecure Code? FormAI-v2 Dataset: Labelling Vulnerabilities in Code Generated by Large Language Models [3.4887856546295333]
This study provides a comparative analysis of state-of-the-art large language models (LLMs) We analyze how likely they generate vulnerabilities when writing simple C programs using a neutral zero-shot prompt.
arXiv Detail & Related papers (2024-04-29T01:24:14Z)
MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data [85.50740598523818]
MUSTARD is a framework that masters uniform synthesis of theorem and proof data of high quality and diversity. We present a theorem-and-proof benchmark MUSTARDSAUCE with 5,866 valid data points. We perform extensive analysis and demonstrate that MUSTARD generates validated high-quality step-by-step data.
arXiv Detail & Related papers (2024-02-14T05:57:58Z)
Leveraging Print Debugging to Improve Code Generation in Large Language Models [63.63160583432348]
Large language models (LLMs) have made significant progress in code generation tasks. But their performance in tackling programming problems with complex data structures and algorithms remains suboptimal. We propose an in-context learning approach that guides LLMs to debug by using a "print debug" method.
arXiv Detail & Related papers (2024-01-10T18:37:59Z)
Prompt Engineering or Fine-Tuning: An Empirical Assessment of LLMs for Code [7.760653867600283]
We evaluate GPT-4 using three prompt engineering strategies -- basic prompting, in-context learning, and task-specific prompting. We compare it against 17 fine-tuned models across three code-related tasks: code summarization, generation, and translation.
arXiv Detail & Related papers (2023-10-11T00:21:00Z)
Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification [40.83776920225375]
OpenAI's latest version of GPT-4, known as GPT-4 Code Interpreter, shows remarkable performance on challenging math datasets. We propose a novel and effective prompting method, explicit ulinecode-based ulineself-ulineverification(CSV) We achieve an impressive zero-shot accuracy on MATH dataset textbf(53.9% $to$ 84.3%).
arXiv Detail & Related papers (2023-08-15T17:58:45Z)
FacTool: Factuality Detection in Generative AI -- A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios [87.12753459582116]
A wider range of tasks now face an increasing risk of containing factual errors when handled by generative models. We propose FacTool, a task and domain agnostic framework for detecting factual errors of texts generated by large language models.
arXiv Detail & Related papers (2023-07-25T14:20:51Z)
Teaching Large Language Models to Self-Debug [62.424077000154945]
Large language models (LLMs) have achieved impressive performance on code generation. We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
arXiv Detail & Related papers (2023-04-11T10:43:43Z)
Finetuned Language Models Are Zero-Shot Learners [67.70352207685558]
We show that instruction tuning boosts zero-shot performance on unseen tasks. We take a 137B parameter pretrained language model and instruction-tune it on over 60 NLP tasks verbalized via natural language instruction templates. We evaluate this instruction-tuned model, which we call FLAN, on unseen task types.
arXiv Detail & Related papers (2021-09-03T17:55:52Z)
TuringAdvice: A Generative and Dynamic Evaluation of Language Use [90.3029315711237]
We propose TuringAdvice, a new challenge task and dataset for language understanding models. Given a written situation that a real person is currently facing, a model must generate helpful advice in natural language. Empirical results show that today's models struggle at TuringAdvice.
arXiv Detail & Related papers (2020-04-07T18:00:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.