Automatic Generation of Test Cases based on Bug Reports: a Feasibility
Study with Large Language Models
- URL: http://arxiv.org/abs/2310.06320v1
- Date: Tue, 10 Oct 2023 05:30:12 GMT
- Title: Automatic Generation of Test Cases based on Bug Reports: a Feasibility
Study with Large Language Models
- Authors: Laura Plein, Wendk\^uuni C. Ou\'edraogo, Jacques Klein, Tegawend\'e F.
Bissyand\'e
- Abstract summary: Existing approaches produce test cases that either can be qualified as simple (e.g. unit tests) or that require precise specifications.
Most testing procedures still rely on test cases written by humans to form test suites.
We investigate the feasibility of performing this generation by leveraging large language models (LLMs) and using bug reports as inputs.
- Score: 4.318319522015101
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Software testing is a core discipline in software engineering where a large
array of research results has been produced, notably in the area of automatic
test generation. Because existing approaches produce test cases that either can
be qualified as simple (e.g. unit tests) or that require precise
specifications, most testing procedures still rely on test cases written by
humans to form test suites. Such test suites, however, are incomplete: they
only cover parts of the project or they are produced after the bug is fixed.
Yet, several research challenges, such as automatic program repair, and
practitioner processes, build on the assumption that available test suites are
sufficient. There is thus a need to break existing barriers in automatic test
case generation. While prior work largely focused on random unit testing
inputs, we propose to consider generating test cases that realistically
represent complex user execution scenarios, which reveal buggy behaviour. Such
scenarios are informally described in bug reports, which should therefore be
considered as natural inputs for specifying bug-triggering test cases. In this
work, we investigate the feasibility of performing this generation by
leveraging large language models (LLMs) and using bug reports as inputs. Our
experiments include the use of ChatGPT, as an online service, as well as
CodeGPT, a code-related pre-trained LLM that was fine-tuned for our task.
Overall, we experimentally show that bug reports associated to up to 50% of
Defects4J bugs can prompt ChatGPT to generate an executable test case. We show
that even new bug reports can indeed be used as input for generating executable
test cases. Finally, we report experimental results which confirm that
LLM-generated test cases are immediately useful in software engineering tasks
such as fault localization as well as patch validation in automated program
repair.
Related papers
- Context-Aware Testing: A New Paradigm for Model Testing with Large Language Models [49.06068319380296]
We introduce context-aware testing (CAT) which uses context as an inductive bias to guide the search for meaningful model failures.
We instantiate the first CAT system, SMART Testing, which employs large language models to hypothesize relevant and likely failures.
arXiv Detail & Related papers (2024-10-31T15:06:16Z) - Leveraging Large Language Models for Enhancing the Understandability of Generated Unit Tests [4.574205608859157]
We introduce UTGen, which combines search-based software testing and large language models to enhance the understandability of automatically generated test cases.
We observe that participants working on assignments with UTGen test cases fix up to 33% more bugs and use up to 20% less time when compared to baseline test cases.
arXiv Detail & Related papers (2024-08-21T15:35:34Z) - Improving LLM-based Unit test generation via Template-based Repair [8.22619177301814]
Unit test is crucial for detecting bugs in individual program units but consumes time and effort.
Large language models (LLMs) have demonstrated remarkable reasoning and generation capabilities.
In this paper, we propose TestART, a novel unit test generation method.
arXiv Detail & Related papers (2024-08-06T10:52:41Z) - GPT-HateCheck: Can LLMs Write Better Functional Tests for Hate Speech Detection? [50.53312866647302]
HateCheck is a suite for testing fine-grained model functionalities on synthesized data.
We propose GPT-HateCheck, a framework to generate more diverse and realistic functional tests from scratch.
Crowd-sourced annotation demonstrates that the generated test cases are of high quality.
arXiv Detail & Related papers (2024-02-23T10:02:01Z) - Enriching Automatic Test Case Generation by Extracting Relevant Test
Inputs from Bug Reports [8.85274953789614]
name is a technique for exploring bug reports to identify input values that can be fed to automatic test generation tools.
For Defects4J projects, our study has shown that name successfully extracted 68.68% of relevant inputs when using regular expression in its approach.
arXiv Detail & Related papers (2023-12-22T18:19:33Z) - Effective Test Generation Using Pre-trained Large Language Models and
Mutation Testing [13.743062498008555]
We introduce MuTAP for improving the effectiveness of test cases generated by Large Language Models (LLMs) in terms of revealing bugs.
MuTAP is capable of generating effective test cases in the absence of natural language descriptions of the Program Under Test (PUTs)
Our results show that our proposed method is able to detect up to 28% more faulty human-written code snippets.
arXiv Detail & Related papers (2023-08-31T08:48:31Z) - Towards Automatic Generation of Amplified Regression Test Oracles [44.45138073080198]
We propose a test oracle derivation approach to amplify regression test oracles.
The approach monitors the object state during test execution and compares it to the previous version to detect any changes in relation to the SUT's intended behaviour.
arXiv Detail & Related papers (2023-07-28T12:38:44Z) - CodeT: Code Generation with Generated Tests [49.622590050797236]
We explore the use of pre-trained language models to automatically generate test cases.
CodeT executes the code solutions using the generated test cases, and then chooses the best solution.
We evaluate CodeT on five different pre-trained models with both HumanEval and MBPP benchmarks.
arXiv Detail & Related papers (2022-07-21T10:18:37Z) - Fault-Aware Neural Code Rankers [64.41888054066861]
We propose fault-aware neural code rankers that can predict the correctness of a sampled program without executing it.
Our fault-aware rankers can significantly increase the pass@1 accuracy of various code generation models.
arXiv Detail & Related papers (2022-06-04T22:01:05Z) - Detection of Coincidentally Correct Test Cases through Random Forests [1.2891210250935143]
We propose a hybrid approach of ensemble learning combined with a supervised learning algorithm namely, Random Forests (RF) for the purpose of correctly identifying test cases that are mislabeled to be the passing test cases.
A cost-effective analysis of flipping the test status or trimming (i.e., eliminating from the computation) the coincidental correct test cases is also reported.
arXiv Detail & Related papers (2020-06-14T15:01:53Z) - Beyond Accuracy: Behavioral Testing of NLP models with CheckList [66.42971817954806]
CheckList is a task-agnostic methodology for testing NLP models.
CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation.
In a user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.
arXiv Detail & Related papers (2020-05-08T15:48:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.