Can ChatGPT support software verification?
- URL: http://arxiv.org/abs/2311.02433v1
- Date: Sat, 4 Nov 2023 15:25:18 GMT
- Title: Can ChatGPT support software verification?
- Authors: Christian Jan{\ss}en, Cedric Richter, Heike Wehrheim
- Abstract summary: We ask ChatGPT to annotate 106 C programs with loop invariants.
We check validity and usefulness of the generated invariants by passing them to two verifiers, Frama-C and CPAchecker.
Our evaluation shows that ChatGPT is able to produce valid and useful invariants allowing Frama-C to verify tasks that it could not solve before.
- Score: 0.9668407688201361
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Large language models have become increasingly effective in software
engineering tasks such as code generation, debugging and repair. Language
models like ChatGPT can not only generate code, but also explain its inner
workings and in particular its correctness. This raises the question whether we
can utilize ChatGPT to support formal software verification.
In this paper, we take some first steps towards answering this question. More
specifically, we investigate whether ChatGPT can generate loop invariants. Loop
invariant generation is a core task in software verification, and the
generation of valid and useful invariants would likely help formal verifiers.
To provide some first evidence on this hypothesis, we ask ChatGPT to annotate
106 C programs with loop invariants. We check validity and usefulness of the
generated invariants by passing them to two verifiers, Frama-C and CPAchecker.
Our evaluation shows that ChatGPT is able to produce valid and useful
invariants allowing Frama-C to verify tasks that it could not solve before.
Based on our initial insights, we propose ways of combining ChatGPT (or large
language models in general) and software verifiers, and discuss current
limitations and open issues.
Related papers
- Fight Fire with Fire: How Much Can We Trust ChatGPT on Source Code-Related Tasks? [10.389763758883975]
Recent studies proposed utilizing ChatGPT as both a developer and tester for collaborative software development.
We conduct a comprehensive empirical investigation to evaluate ChatGPT's self-verification capability in code generation, code completion, and program repair.
arXiv Detail & Related papers (2024-05-21T09:47:33Z) - Investigating the Utility of ChatGPT in the Issue Tracking System: An
Exploratory Study [5.176434782905268]
This study examines the interaction between ChatGPT and developers to analyze their prevalent activities and provide a resolution.
Our investigation reveals that developers mainly use ChatGPT for brainstorming solutions but often opt to write their code instead of using ChatGPT-generated code.
arXiv Detail & Related papers (2024-02-06T06:03:05Z) - Exploring ChatGPT's Capabilities on Vulnerability Management [56.4403395100589]
We explore ChatGPT's capabilities on 6 tasks involving the complete vulnerability management process with a large-scale dataset containing 70,346 samples.
One notable example is ChatGPT's proficiency in tasks like generating titles for software bug reports.
Our findings reveal the difficulties encountered by ChatGPT and shed light on promising future directions.
arXiv Detail & Related papers (2023-11-11T11:01:13Z) - Primacy Effect of ChatGPT [69.49920102917598]
We study the primacy effect of ChatGPT: the tendency of selecting the labels at earlier positions as the answer.
We hope that our experiments and analyses provide additional insights into building more reliable ChatGPT-based solutions.
arXiv Detail & Related papers (2023-10-20T00:37:28Z) - Automatic Code Summarization via ChatGPT: How Far Are We? [10.692654700225411]
We evaluate ChatGPT on a widely-used Python dataset called CSN-Python.
In terms of BLEU and ROUGE-L, ChatGPT's code summarization performance is significantly worse than all three SOTA models.
Based on the findings, we outline several open challenges and opportunities in ChatGPT-based code summarization.
arXiv Detail & Related papers (2023-05-22T09:43:40Z) - ChatLog: Carefully Evaluating the Evolution of ChatGPT Across Time [54.18651663847874]
ChatGPT has achieved great success and can be considered to have acquired an infrastructural status.
Existing benchmarks encounter two challenges: (1) Disregard for periodical evaluation and (2) Lack of fine-grained features.
We construct ChatLog, an ever-updating dataset with large-scale records of diverse long-form ChatGPT responses for 21 NLP benchmarks from March, 2023 to now.
arXiv Detail & Related papers (2023-04-27T11:33:48Z) - Is ChatGPT the Ultimate Programming Assistant -- How far is it? [11.943927095071105]
ChatGPT has received great attention: it can be used as a bot for discussing source code.
We present an empirical study of ChatGPT's potential as a fully automated programming assistant.
arXiv Detail & Related papers (2023-04-24T09:20:13Z) - When do you need Chain-of-Thought Prompting for ChatGPT? [87.45382888430643]
Chain-of-Thought (CoT) prompting can effectively elicit complex multi-step reasoning from Large Language Models(LLMs)
It is not clear whether CoT is still effective on more recent instruction finetuned (IFT) LLMs such as ChatGPT.
arXiv Detail & Related papers (2023-04-06T17:47:29Z) - ChatGPT or Grammarly? Evaluating ChatGPT on Grammatical Error Correction
Benchmark [11.36853733574956]
ChatGPT is a cutting-edge artificial intelligence language model developed by OpenAI.
We compare it with commercial GEC product (e.g., Grammarly) and state-of-the-art models (e.g., GECToR)
We find that ChatGPT performs not as well as those baselines in terms of the automatic evaluation metrics.
arXiv Detail & Related papers (2023-03-15T00:35:50Z) - Can ChatGPT Understand Too? A Comparative Study on ChatGPT and
Fine-tuned BERT [103.57103957631067]
ChatGPT has attracted great attention, as it can generate fluent and high-quality responses to human inquiries.
We evaluate ChatGPT's understanding ability by evaluating it on the most popular GLUE benchmark, and comparing it with 4 representative fine-tuned BERT-style models.
We find that: 1) ChatGPT falls short in handling paraphrase and similarity tasks; 2) ChatGPT outperforms all BERT models on inference tasks by a large margin; 3) ChatGPT achieves comparable performance compared with BERT on sentiment analysis and question answering tasks.
arXiv Detail & Related papers (2023-02-19T12:29:33Z) - Is ChatGPT a General-Purpose Natural Language Processing Task Solver? [113.22611481694825]
Large language models (LLMs) have demonstrated the ability to perform a variety of natural language processing (NLP) tasks zero-shot.
Recently, the debut of ChatGPT has drawn a great deal of attention from the natural language processing (NLP) community.
It is not yet known whether ChatGPT can serve as a generalist model that can perform many NLP tasks zero-shot.
arXiv Detail & Related papers (2023-02-08T09:44:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.