Evaluating the Ability of Large Language Models to Generate Verifiable Specifications in VeriFast
- URL: http://arxiv.org/abs/2411.02318v2
- Date: Tue, 05 Nov 2024 07:13:13 GMT
- Title: Evaluating the Ability of Large Language Models to Generate Verifiable Specifications in VeriFast
- Authors: Marilyn Rego, Wen Fan, Xin Hu, Sanya Dod, Zhaorui Ni, Danning Xie, Jenna DiVincenzo, Lin Tan,
- Abstract summary: Large language models (LLMs) have shown promise in a number of software engineering activities.
This paper explores the effectiveness of OpenAI's GPT models in generating specifications based on separation logic.
- Score: 5.019791860882564
- License:
- Abstract: Static verification is a powerful method for enhancing software quality, but it demands significant human labor and resources. This is particularly true of static verifiers that reason about heap manipulating programs using an ownership logic. LLMs have shown promise in a number of software engineering activities, including code generation, test generation, proof generation for theorem provers, and specification generation for static verifiers. However, prior work has not explored how well LLMs can perform specification generation for specifications based in an ownership logic, such as separation logic. To address this gap, this paper explores the effectiveness of large language models (LLMs), specifically OpenAI's GPT models, in generating fully correct specifications based on separation logic for static verification of human-written programs in VeriFast. Our first experiment employed traditional prompt engineering and the second used Chain-of-Thought (CoT) Prompting to identify and address common errors generated across the GPT models. The results indicate that GPT models can successfully generate specifications for verifying heap manipulating code with VeriFast. Furthermore, while CoT prompting significantly reduces syntax errors generated by the GPT models, it does not greatly improve verification error rates compared to prompt engineering.
Related papers
- Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning.
LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors.
We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z) - Insights from Benchmarking Frontier Language Models on Web App Code Generation [1.7268889851975326]
This paper presents insights from evaluating 16 frontier large language models (LLMs) on the WebApp1K benchmark.
The results reveal that while all models possess similar underlying knowledge, their performance is differentiated by the frequency of mistakes they make.
arXiv Detail & Related papers (2024-09-08T18:24:26Z) - Generative Verifiers: Reward Modeling as Next-Token Prediction [29.543787728397643]
Verifiers or reward models are often used to enhance the reasoning performance of large language models (LLMs)
We propose training verifiers using the ubiquitous next-token prediction objective, jointly on verification and solution generation.
We demonstrate that GenRM outperforms discriminative, DPO verifiers, and LLM-as-a-Judge.
arXiv Detail & Related papers (2024-08-27T17:57:45Z) - Understanding Defects in Generated Codes by Language Models [0.669087470775851]
This study categorizes and analyzes 367 identified defects from code snippets generated by Large Language Models.
Error categories indicate key areas where LLMs frequently fail, underscoring the need for targeted improvements.
This paper implemented five prompt engineering techniques, including Scratchpad Prompting, Program of Thoughts Prompting, Chain-of-Thought Prompting, Chain-of-Thought Prompting, and Structured Chain-of-Thought Prompting.
arXiv Detail & Related papers (2024-08-23T21:10:09Z) - Leveraging Large Language Models for Automated Proof Synthesis in Rust [6.202137610101939]
Large Language Models (LLMs) have shown success in code analysis and synthesis.
We present a combination of LLMs and static analysis to synthesize invariants, assertions, and other proof structures for a Rust-based formal verification framework called Verus.
Our prototype decomposes the verification task into multiple smaller ones, iteratively queries GPT-4, and combines its output with lightweight static analysis.
arXiv Detail & Related papers (2023-11-07T05:47:47Z) - Knowledge-Augmented Language Model Verification [68.6099592486075]
Recent Language Models (LMs) have shown impressive capabilities in generating texts with the knowledge internalized in parameters.
We propose to verify the output and the knowledge of the knowledge-augmented LMs with a separate verifier.
Our results show that the proposed verifier effectively identifies retrieval and generation errors, allowing LMs to provide more factually correct outputs.
arXiv Detail & Related papers (2023-10-19T15:40:00Z) - Test-Case-Driven Programming Understanding in Large Language Models for
Better Code Generation [15.166827643436346]
muFiX is a novel prompting technique to improve the code generation performance of large language models (LLMs)
It first exploits test case analysis to obtain specification understanding and enables a self-improvement process.
muFiX further fixes the specification understanding towards the direction reducing the gap between the provided understanding and the actual understanding.
arXiv Detail & Related papers (2023-09-28T02:58:07Z) - Self-Checker: Plug-and-Play Modules for Fact-Checking with Large Language Models [75.75038268227554]
Self-Checker is a framework comprising a set of plug-and-play modules that facilitate fact-checking.
This framework provides a fast and efficient way to construct fact-checking systems in low-resource environments.
arXiv Detail & Related papers (2023-05-24T01:46:07Z) - CodeRL: Mastering Code Generation through Pretrained Models and Deep
Reinforcement Learning [92.36705236706678]
"CodeRL" is a new framework for program synthesis tasks through pretrained LMs and deep reinforcement learning.
During inference, we introduce a new generation procedure with a critical sampling strategy.
For the model backbones, we extended the encoder-decoder architecture of CodeT5 with enhanced learning objectives.
arXiv Detail & Related papers (2022-07-05T02:42:15Z) - Fault-Aware Neural Code Rankers [64.41888054066861]
We propose fault-aware neural code rankers that can predict the correctness of a sampled program without executing it.
Our fault-aware rankers can significantly increase the pass@1 accuracy of various code generation models.
arXiv Detail & Related papers (2022-06-04T22:01:05Z) - Exploring Software Naturalness through Neural Language Models [56.1315223210742]
The Software Naturalness hypothesis argues that programming languages can be understood through the same techniques used in natural language processing.
We explore this hypothesis through the use of a pre-trained transformer-based language model to perform code analysis tasks.
arXiv Detail & Related papers (2020-06-22T21:56:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.