Evaluating the Ability of Large Language Models to Generate Verifiable Specifications in VeriFast
- URL: http://arxiv.org/abs/2411.02318v3
- Date: Fri, 03 Jan 2025 02:19:03 GMT
- Title: Evaluating the Ability of Large Language Models to Generate Verifiable Specifications in VeriFast
- Authors: Wen Fan, Marilyn Rego, Xin Hu, Sanya Dod, Zhaorui Ni, Danning Xie, Jenna DiVincenzo, Lin Tan,
- Abstract summary: This paper explores OpenAI's GPT-4o model's effectiveness in generating specifications on C programs that are verifiable with VeriFast.
Our results indicate that the specifications generated by GPT-4o preserve functional behavior, but struggle to be verifiable.
- Score: 5.019791860882564
- License:
- Abstract: Static verification is a powerful method for enhancing software quality, but it demands significant human labor and resources. This is particularly true of static verifiers that reason about heap manipulating programs using an ownership logic. LLMs have shown promise in a number of software engineering activities, including code generation, test generation, proof generation for theorem provers, and specification generation for static verifiers. However, prior work has not explored how well LLMs can perform specification generation for specifications based in an ownership logic, such as separation logic. To address this gap, this paper explores OpenAI's GPT-4o model's effectiveness in generating specifications on C programs that are verifiable with VeriFast, a separation logic based static verifier. Our experiment employs three different types of user inputs as well as basic and Chain-of-Thought (CoT) prompting to assess GPT's capabilities. Our results indicate that the specifications generated by GPT-4o preserve functional behavior, but struggle to be verifiable. When the specifications are verifiable they contain redundancies. Future directions are discussed to improve the performance.
Related papers
- A Systematic Approach for Assessing Large Language Models' Test Case Generation Capability [0.8287206589886879]
We propose the Generated Benchmark from Control-Flow Structure and Variable Usage Composition (GBCV) approach to evaluate large language models (LLMs)
By leveraging basic control-flow structures and variable usage, GBCV provides a flexible framework to create a spectrum of programs ranging from simple to complex.
Our findings indicate that GPT-4o performs better on complex program structures, while all models effectively detect boundary values in simple conditions but face challenges with arithmetic computations.
arXiv Detail & Related papers (2025-02-05T03:51:44Z) - Automated Refactoring of Non-Idiomatic Python Code: A Differentiated Replication with LLMs [54.309127753635366]
We present the results of a replication study in which we investigate GPT-4 effectiveness in recommending and suggesting idiomatic actions.
Our findings underscore the potential of LLMs to achieve tasks where, in the past, implementing recommenders based on complex code analyses was required.
arXiv Detail & Related papers (2025-01-28T15:41:54Z) - Automated Proof Generation for Rust Code via Self-Evolution [69.25795662658356]
We introduce SAFE, a novel framework that overcomes the lack of human-written proof to enable automated proof generation of Rust code.
We demonstrate superior efficiency and precision compared to GPT-4o.
This advancement leads to a significant improvement in performance, achieving a 70.50% accuracy rate in a benchmark crafted by human experts.
arXiv Detail & Related papers (2024-10-21T08:15:45Z) - AssertionBench: A Benchmark to Evaluate Large-Language Models for Assertion Generation [6.3585378855805725]
We present a novel benchmark to evaluate Large-Language Models' effectiveness for assertion generation.
AssertioBench contains 100 curated Verilog hardware designs from OpenCores and formally verified assertions for each design generated from GoldMine and HARM.
arXiv Detail & Related papers (2024-06-26T14:47:28Z) - Enchanting Program Specification Synthesis by Large Language Models using Static Analysis and Program Verification [15.686651364655958]
AutoSpec is an automated approach to synthesize specifications for automated program verification.
It overcomes the shortcomings of existing work in specification versatility, synthesizing satisfiable and adequate specifications for full proof.
It can be successfully applied to verify the programs in a real-world X509-parser project.
arXiv Detail & Related papers (2024-03-31T18:15:49Z) - Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers [121.53749383203792]
We present a holistic end-to-end solution for annotating the factuality of large language models (LLMs)-generated responses.
We construct an open-domain document-level factuality benchmark in three-level granularity: claim, sentence and document.
Preliminary experiments show that FacTool, FactScore and Perplexity are struggling to identify false claims.
arXiv Detail & Related papers (2023-11-15T14:41:57Z) - Leveraging Large Language Models for Automated Proof Synthesis in Rust [6.202137610101939]
Large Language Models (LLMs) have shown success in code analysis and synthesis.
We present a combination of LLMs and static analysis to synthesize invariants, assertions, and other proof structures for a Rust-based formal verification framework called Verus.
Our prototype decomposes the verification task into multiple smaller ones, iteratively queries GPT-4, and combines its output with lightweight static analysis.
arXiv Detail & Related papers (2023-11-07T05:47:47Z) - Knowledge-Augmented Language Model Verification [68.6099592486075]
Recent Language Models (LMs) have shown impressive capabilities in generating texts with the knowledge internalized in parameters.
We propose to verify the output and the knowledge of the knowledge-augmented LMs with a separate verifier.
Our results show that the proposed verifier effectively identifies retrieval and generation errors, allowing LMs to provide more factually correct outputs.
arXiv Detail & Related papers (2023-10-19T15:40:00Z) - How Effective are Large Language Models in Generating Software Specifications? [14.170320751508502]
Large Language Models (LLMs) have been successfully applied to numerous Software Engineering (SE) tasks.
We conduct the first empirical study to evaluate the capabilities of LLMs for generating software specifications from software comments or documentation.
arXiv Detail & Related papers (2023-06-06T00:28:39Z) - Self-Checker: Plug-and-Play Modules for Fact-Checking with Large Language Models [75.75038268227554]
Self-Checker is a framework comprising a set of plug-and-play modules that facilitate fact-checking.
This framework provides a fast and efficient way to construct fact-checking systems in low-resource environments.
arXiv Detail & Related papers (2023-05-24T01:46:07Z) - Exploring Software Naturalness through Neural Language Models [56.1315223210742]
The Software Naturalness hypothesis argues that programming languages can be understood through the same techniques used in natural language processing.
We explore this hypothesis through the use of a pre-trained transformer-based language model to perform code analysis tasks.
arXiv Detail & Related papers (2020-06-22T21:56:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.