Related papers: Nuances are the Key: Unlocking ChatGPT to Find Failure-Inducing Tests with Differential Prompting

Nuances are the Key: Unlocking ChatGPT to Find Failure-Inducing Tests with Differential Prompting

URL: http://arxiv.org/abs/2304.11686v6
Date: Sat, 9 Sep 2023 09:55:18 GMT
Title: Nuances are the Key: Unlocking ChatGPT to Find Failure-Inducing Tests with Differential Prompting
Authors: Tsz-On Li, Wenxi Zong, Yibo Wang, Haoye Tian, Ying Wang, Shing-Chi Cheung, Jeff Kramer
Abstract summary: ChatGPT has a low probability (28.8%) of finding correct failure-inducing test cases for buggy programs. A possible reason is that finding failure-inducing test cases requires analyzing the subtle code differences between a buggy program and its correct version. We propose a novel approach that combines ChatGPT and differential testing to find failure-inducing test cases.
Score: 20.914970341922707
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Automatically detecting software failures is an important task and a longstanding challenge. It requires finding failure-inducing test cases whose test input can trigger the software's fault, and constructing an automated oracle to detect the software's incorrect behaviors. Recent advancement of large language models (LLMs) motivates us to study how far this challenge can be addressed by ChatGPT, a state-of-the-art LLM. Unfortunately, our study shows that ChatGPT has a low probability (28.8%) of finding correct failure-inducing test cases for buggy programs. A possible reason is that finding failure-inducing test cases requires analyzing the subtle code differences between a buggy program and its correct version. When these two versions have similar syntax, ChatGPT is weak at recognizing subtle code differences. Our insight is that ChatGPT's performance can be substantially enhanced when ChatGPT is guided to focus on the subtle code difference. We have an interesting observation that ChatGPT is effective in inferring the intended behaviors of a buggy program. The intended behavior can be leveraged to synthesize programs, in order to make the subtle code difference between a buggy program and its correct version (i.e., the synthesized program) explicit. Driven by this observation, we propose a novel approach that synergistically combines ChatGPT and differential testing to find failure-inducing test cases. We evaluate our approach on Quixbugs (a benchmark of buggy programs), and compare it with state-of-the-art baselines, including direct use of ChatGPT and Pynguin. The experimental result shows that our approach has a much higher probability (77.8%) of finding correct failure-inducing test cases, 2.7X as the best baseline.

Related papers

Evaluating ChatGPT-3.5 Efficiency in Solving Coding Problems of Different Complexity Levels: An Empirical Analysis [6.123324869194196]
We assess the performance of ChatGPT's GPT-3.5-turbo model on LeetCode. We show ChatGPT solves fewer problems as difficulty rises. Second, prompt engineering improves ChatGPT's performance. Third, ChatGPT performs better in popular languages like Python, Java, and C++ than in less common ones like Elixir, Erlang, and Racket.
arXiv Detail & Related papers (2024-11-12T04:01:09Z)
Impact of the Availability of ChatGPT on Software Development: A Synthetic Difference in Differences Estimation using GitHub Data [49.1574468325115]
ChatGPT is an AI tool that enhances software production efficiency. We estimate ChatGPT's effects on the number of git pushes, repositories, and unique developers per 100,000 people. These results suggest that AI tools like ChatGPT can substantially boost developer productivity, though further analysis is needed to address potential downsides such as low quality code and privacy concerns.
arXiv Detail & Related papers (2024-06-16T19:11:15Z)
Exploring ChatGPT's Capabilities on Vulnerability Management [56.4403395100589]
We explore ChatGPT's capabilities on 6 tasks involving the complete vulnerability management process with a large-scale dataset containing 70,346 samples. One notable example is ChatGPT's proficiency in tasks like generating titles for software bug reports. Our findings reveal the difficulties encountered by ChatGPT and shed light on promising future directions.
arXiv Detail & Related papers (2023-11-11T11:01:13Z)
A Critical Review of Large Language Model on Software Engineering: An Example from ChatGPT and Automated Program Repair [19.123640635549524]
Large Language Models (LLMs) have been gaining increasing attention and demonstrated promising performance across a variety of software engineering tasks. This paper reviews the bug-fixing capabilities of ChatGPT on a clean APR benchmark with different research objectives. ChatGPT is able to fix 109 out of 151 buggy programs using the basic prompt within 35 independent rounds, outperforming state-of-the-art LLMs CodeT5 and PLBART by 27.5% and 62.4% prediction accuracy.
arXiv Detail & Related papers (2023-10-13T06:11:47Z)
Exploring the Potential of ChatGPT in Automated Code Refinement: An Empirical Study [0.0]
ChatGPT, a cutting-edge language model, has demonstrated impressive performance in various natural language processing tasks. We conduct the first empirical study to understand the capabilities of ChatGPT in code review tasks. Our results show that ChatGPT achieves higher EM and BLEU scores of 22.78 and 76.44 respectively, while the state-of-the-art method achieves only 15.50 and 62.88 on a high-quality code review dataset.
arXiv Detail & Related papers (2023-09-15T07:41:33Z)
Unmasking the giant: A comprehensive evaluation of ChatGPT's proficiency in coding algorithms and data structures [0.6990493129893112]
We evaluate ChatGPT's ability to generate correct solutions to the problems fed to it, its code quality, and nature of run-time errors thrown by its code. We look into patterns in the test cases passed in order to gain some insights into how wrong ChatGPT code is in these kinds of situations.
arXiv Detail & Related papers (2023-07-10T08:20:34Z)
Can ChatGPT Understand Too? A Comparative Study on ChatGPT and Fine-tuned BERT [103.57103957631067]
ChatGPT has attracted great attention, as it can generate fluent and high-quality responses to human inquiries. We evaluate ChatGPT's understanding ability by evaluating it on the most popular GLUE benchmark, and comparing it with 4 representative fine-tuned BERT-style models. We find that: 1) ChatGPT falls short in handling paraphrase and similarity tasks; 2) ChatGPT outperforms all BERT models on inference tasks by a large margin; 3) ChatGPT achieves comparable performance compared with BERT on sentiment analysis and question answering tasks.
arXiv Detail & Related papers (2023-02-19T12:29:33Z)
A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity [79.12003701981092]
We carry out an extensive technical evaluation of ChatGPT using 23 data sets covering 8 different common NLP application tasks. We evaluate the multitask, multilingual and multi-modal aspects of ChatGPT based on these data sets and a newly designed multimodal dataset. ChatGPT is 63.41% accurate on average in 10 different reasoning categories under logical reasoning, non-textual reasoning, and commonsense reasoning.
arXiv Detail & Related papers (2023-02-08T12:35:34Z)
Is ChatGPT a General-Purpose Natural Language Processing Task Solver? [113.22611481694825]
Large language models (LLMs) have demonstrated the ability to perform a variety of natural language processing (NLP) tasks zero-shot. Recently, the debut of ChatGPT has drawn a great deal of attention from the natural language processing (NLP) community. It is not yet known whether ChatGPT can serve as a generalist model that can perform many NLP tasks zero-shot.
arXiv Detail & Related papers (2023-02-08T09:44:51Z)
Fault-Aware Neural Code Rankers [64.41888054066861]
We propose fault-aware neural code rankers that can predict the correctness of a sampled program without executing it. Our fault-aware rankers can significantly increase the pass@1 accuracy of various code generation models.
arXiv Detail & Related papers (2022-06-04T22:01:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.