Related papers: Towards Formal Verification of LLM-Generated Code from Natural Language Prompts

Towards Formal Verification of LLM-Generated Code from Natural Language Prompts

URL: http://arxiv.org/abs/2507.13290v1
Date: Thu, 17 Jul 2025 16:54:42 GMT
Title: Towards Formal Verification of LLM-Generated Code from Natural Language Prompts
Authors: Aaron Councilman, David Fu, Aryan Gupta, Chengxiao Wang, David Grove, Yu-Xiong Wang, Vikram Adve,
Abstract summary: We seek to offer formal guarantees of correctness to LLM generated code.<n>We propose to incorporate a formal query language that can represent a user's intent in a formally defined but natural language-like manner.<n>Our verifier is able to verify correct code in 83% of cases and identify incorrect code in 92%.
Score: 17.130884318613944
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: In the past few years LLMs have emerged as a tool that can aid programmers by taking natural language descriptions and generating code based on it. However, LLMs often generate incorrect code that users need to fix and the literature suggests users often struggle to detect these errors. In this work we seek to offer formal guarantees of correctness to LLM generated code; such guarantees could improve the experience of using AI Code Assistants and potentially enable natural language programming for users with little or no programming knowledge. To address this challenge we propose to incorporate a formal query language that can represent a user's intent in a formally defined but natural language-like manner that a user can confirm matches their intent. Then, using such a query we propose to verify LLM generated code to ensure it matches the user's intent. We implement these ideas in our system, Astrogator, for the Ansible programming language which includes such a formal query language, a calculus for representing the behavior of Ansible programs, and a symbolic interpreter which is used for the verification. On a benchmark suite of 21 code-generation tasks, our verifier is able to verify correct code in 83% of cases and identify incorrect code in 92%.

Related papers

IFEvalCode: Controlled Code Generation [69.28317223249358]
The paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs.<n>The authors present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages.
arXiv Detail & Related papers (2025-07-30T08:08:48Z)
Type-Constrained Code Generation with Language Models [51.03439021895432]
We introduce a type-constrained decoding approach that leverages type systems to guide code generation.<n>For this purpose, we develop novel prefix automata and a search over inhabitable types, forming a sound approach to enforce well-typedness on LLM-generated code.<n>Our approach reduces compilation errors by more than half and significantly increases functional correctness in code synthesis, translation, and repair tasks.
arXiv Detail & Related papers (2025-04-12T15:03:00Z)
How Accurately Do Large Language Models Understand Code? [4.817546726074033]
Large Language Models (LLMs) are increasingly used in post-development tasks such as code repair and testing.<n> Quantifying code comprehension is challenging due to its abstract nature and the lack of a standardized metric.<n>This paper presents the first large-scale empirical investigation into LLMs' ability to understand code.
arXiv Detail & Related papers (2025-04-06T05:59:29Z)
HoarePrompt: Structural Reasoning About Program Correctness in Natural Language [6.0749049701897295]
HoarePrompt is a novel approach that adapts fundamental ideas from program analysis and verification to natural language artifacts.<n>To manage loops, we propose few-shot-driven k-induction, an adaptation of the k-induction method widely used in model checking.<n>Our experiments show that HoarePrompt improves the MCC by 62% compared to directly using Zero-shot-CoT prompts for correctness classification.
arXiv Detail & Related papers (2025-03-25T12:30:30Z)
Dafny as Verification-Aware Intermediate Language for Code Generation [0.0]
Large language models (LLMs) generate source code from natural language prompts.<n>One of its limitations is that the generated code can be faulty at times, despite being presented to the user as correct.<n>We propose that the user guides the LLM to first generate an opaque intermediate representation, in the verification-aware language Dafny.<n>The correct Dafny program is then compiled to the target language and returned to the user.
arXiv Detail & Related papers (2025-01-10T17:23:14Z)
Synthetic Programming Elicitation for Text-to-Code in Very Low-Resource Programming and Formal Languages [21.18996339478024]
We introduce emphsynthetic programming elicitation and compilation (SPEAC) SPEAC produces syntactically correct programs more frequently and without sacrificing semantic correctness. We empirically evaluate the performance of SPEAC in a case study for the UCLID5 formal verification language.
arXiv Detail & Related papers (2024-06-05T22:16:19Z)
Code Prompting Elicits Conditional Reasoning Abilities in Text+Code LLMs [65.2379940117181]
We introduce code prompting, a chain of prompts that transforms a natural language problem into code. We find that code prompting exhibits a high-performance boost for multiple LLMs. Our analysis of GPT 3.5 reveals that the code formatting of the input problem is essential for performance improvement.
arXiv Detail & Related papers (2024-01-18T15:32:24Z)
If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code Empowers Large Language Models to Serve as Intelligent Agents [81.60906807941188]
Large language models (LLMs) are trained on a combination of natural language and formal language (code) Code translates high-level goals into executable steps, featuring standard syntax, logical consistency, abstraction, and modularity.
arXiv Detail & Related papers (2024-01-01T16:51:20Z)
Can Large Language Models Transform Natural Language Intent into Formal Method Postconditions? [17.03841665553565]
Large Language Models (LLMs) have the potential to facilitate the translation of natural language intent to specifications that match programmer intent. In this paper we describe nl2postcond, the problem of leveraging LLMs for informal natural language formal method postconditions, expressed as program assertions.
arXiv Detail & Related papers (2023-10-03T06:55:45Z)
Fixing Large Language Models' Specification Misunderstanding for Better Code Generation [13.494822086550604]
muFiX is a novel prompting technique to improve the code generation performance of large language models (LLMs)<n>It first exploits test case analysis to obtain specification understanding and enables a self-improvement process.<n>muFiX further fixes the specification understanding towards the direction reducing the gap between the provided understanding and the actual understanding.
arXiv Detail & Related papers (2023-09-28T02:58:07Z)
LEVER: Learning to Verify Language-to-Code Generation with Execution [64.36459105535]
We propose LEVER, a simple approach to improve language-to-code generation by learning to verify the generated programs with their execution results. Specifically, we train verifiers to determine whether a program sampled from the LLMs is correct or not based on the natural language input, the program itself and its execution results. LEVER consistently improves over the base code LLMs(4.6% to 10.9% with code-davinci) and achieves new state-of-the-art results on all of them.
arXiv Detail & Related papers (2023-02-16T18:23:22Z)
Interactive Code Generation via Test-Driven User-Intent Formalization [60.90035204567797]
Large language models (LLMs) produce code from informal natural language (NL) intent. It is hard to define a notion of correctness since natural language can be ambiguous and lacks a formal semantics. We describe a language-agnostic abstract algorithm and a concrete implementation TiCoder.
arXiv Detail & Related papers (2022-08-11T17:41:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.