Can LLMs Enable Verification in Mainstream Programming?
- URL: http://arxiv.org/abs/2503.14183v1
- Date: Tue, 18 Mar 2025 11:58:00 GMT
- Title: Can LLMs Enable Verification in Mainstream Programming?
- Authors: Aleksandr Shefer, Igor Engel, Stanislav Alekseev, Daniil Berezun, Ekaterina Verbitskaia, Anton Podkopaev,
- Abstract summary: We explore the ability of LLMs to produce verified code in three verification languages.<n>To do so, we use manually curated datasets derived from the state-ofthe-art Python benchmark, HumanEval.
- Score: 37.69303106863453
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Although formal methods are capable of producing reliable software, they have seen minimal adoption in everyday programming. Automatic code generation using large language models is becoming increasingly widespread, but it rarely considers producing strong correctness guarantees. In this study, we explore the ability of LLMs to produce verified code in three verification languages (Dafny, Nagini, and Verus). To do so, we use manually curated datasets derived from the state-ofthe-art Python benchmark, HumanEval. We also assess what types of information are sufficient to achieve good-quality results.
Related papers
- CoDet-M4: Detecting Machine-Generated Code in Multi-Lingual, Multi-Generator and Multi-Domain Settings [32.72039589832989]
Large language models (LLMs) have revolutionized code generation, automating programming with remarkable efficiency.
These advancements challenge programming skills, ethics, and assessment integrity, making the detection of LLM-generated code essential for maintaining accountability and standards.
We propose a framework capable of distinguishing between human- and LLM-written code across multiple programming languages, code generators, and domains.
arXiv Detail & Related papers (2025-03-17T21:41:37Z) - Dafny as Verification-Aware Intermediate Language for Code Generation [0.0]
Large language models (LLMs) generate source code from natural language prompts.<n>One of its limitations is that the generated code can be faulty at times, despite being presented to the user as correct.<n>We propose that the user guides the LLM to first generate an opaque intermediate representation, in the verification-aware language Dafny.<n>The correct Dafny program is then compiled to the target language and returned to the user.
arXiv Detail & Related papers (2025-01-10T17:23:14Z) - Effective LLM-Driven Code Generation with Pythoness [0.0]
Pythoness is an embedded domain-specific language for code generation using large language models (LLMs)<n>In Pythoness, developers operate at the level of behavioral specifications when writing functions, classes, or an entire program.<n>We show that Pythoness can successfully leverage a combination of tests and code generation to yield higher quality code than specifications alone.
arXiv Detail & Related papers (2025-01-03T23:14:46Z) - From Effectiveness to Efficiency: Uncovering Linguistic Bias in Large Language Model-based Code Generation [30.914387085368734]
Large Language Models (LLMs) have demonstrated promising capabilities for code generation.
In this paper, we aim to investigate the potential linguistic bias through the lens of English and Chinese.
arXiv Detail & Related papers (2024-06-02T03:22:30Z) - CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation [58.84212778960507]
We propose CodeGRAG, a Graphical Retrieval Augmented Code Generation framework to enhance the performance of LLMs.
CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to fill the gap between programming languages and natural language.
Various experiments and ablations are done on four datasets including both the C++ and python languages to validate the hard meta-graph prompt, the soft prompting technique, and the effectiveness of the objectives for pretrained GNN expert.
arXiv Detail & Related papers (2024-05-03T02:48:55Z) - Large Language Model-Aware In-Context Learning for Code Generation [75.68709482932903]
Large language models (LLMs) have shown impressive in-context learning (ICL) ability in code generation.
We propose a novel learning-based selection approach named LAIL (LLM-Aware In-context Learning) for code generation.
arXiv Detail & Related papers (2023-10-15T06:12:58Z) - CodeApex: A Bilingual Programming Evaluation Benchmark for Large
Language Models [43.655927559990616]
We propose CodeApex, a benchmark dataset focusing on the programming comprehension, code generation, and code correction abilities of LLMs.
We evaluate 12 widely used LLMs, including both general-purpose and specialized models.
GPT-4 exhibits the best programming capabilities, achieving approximate accuracy of 69%, 54%, and 66% on the three tasks, respectively.
arXiv Detail & Related papers (2023-09-05T04:12:01Z) - How Effective are Large Language Models in Generating Software Specifications? [14.170320751508502]
Large Language Models (LLMs) have been successfully applied to numerous Software Engineering (SE) tasks.<n>We conduct the first empirical study to evaluate the capabilities of LLMs for generating software specifications from software comments or documentation.
arXiv Detail & Related papers (2023-06-06T00:28:39Z) - Self-Checker: Plug-and-Play Modules for Fact-Checking with Large Language Models [75.75038268227554]
Self-Checker is a framework comprising a set of plug-and-play modules that facilitate fact-checking.
This framework provides a fast and efficient way to construct fact-checking systems in low-resource environments.
arXiv Detail & Related papers (2023-05-24T01:46:07Z) - LEVER: Learning to Verify Language-to-Code Generation with Execution [64.36459105535]
We propose LEVER, a simple approach to improve language-to-code generation by learning to verify the generated programs with their execution results.
Specifically, we train verifiers to determine whether a program sampled from the LLMs is correct or not based on the natural language input, the program itself and its execution results.
LEVER consistently improves over the base code LLMs(4.6% to 10.9% with code-davinci) and achieves new state-of-the-art results on all of them.
arXiv Detail & Related papers (2023-02-16T18:23:22Z) - X-FACTR: Multilingual Factual Knowledge Retrieval from Pretrained
Language Models [103.75890012041366]
Language models (LMs) have proven surprisingly successful at capturing factual knowledge.
However, studies on LMs' factual representation ability have almost invariably been performed on English.
We create a benchmark of cloze-style probes for 23 typologically diverse languages.
arXiv Detail & Related papers (2020-10-13T05:29:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.