Automatic Generation of Test Cases based on Bug Reports: a Feasibility
Study with Large Language Models
- URL: http://arxiv.org/abs/2310.06320v1
- Date: Tue, 10 Oct 2023 05:30:12 GMT
- Title: Automatic Generation of Test Cases based on Bug Reports: a Feasibility
Study with Large Language Models
- Authors: Laura Plein, Wendk\^uuni C. Ou\'edraogo, Jacques Klein, Tegawend\'e F.
Bissyand\'e
- Abstract summary: Existing approaches produce test cases that either can be qualified as simple (e.g. unit tests) or that require precise specifications.
Most testing procedures still rely on test cases written by humans to form test suites.
We investigate the feasibility of performing this generation by leveraging large language models (LLMs) and using bug reports as inputs.
- Score: 4.318319522015101
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Software testing is a core discipline in software engineering where a large
array of research results has been produced, notably in the area of automatic
test generation. Because existing approaches produce test cases that either can
be qualified as simple (e.g. unit tests) or that require precise
specifications, most testing procedures still rely on test cases written by
humans to form test suites. Such test suites, however, are incomplete: they
only cover parts of the project or they are produced after the bug is fixed.
Yet, several research challenges, such as automatic program repair, and
practitioner processes, build on the assumption that available test suites are
sufficient. There is thus a need to break existing barriers in automatic test
case generation. While prior work largely focused on random unit testing
inputs, we propose to consider generating test cases that realistically
represent complex user execution scenarios, which reveal buggy behaviour. Such
scenarios are informally described in bug reports, which should therefore be
considered as natural inputs for specifying bug-triggering test cases. In this
work, we investigate the feasibility of performing this generation by
leveraging large language models (LLMs) and using bug reports as inputs. Our
experiments include the use of ChatGPT, as an online service, as well as
CodeGPT, a code-related pre-trained LLM that was fine-tuned for our task.
Overall, we experimentally show that bug reports associated to up to 50% of
Defects4J bugs can prompt ChatGPT to generate an executable test case. We show
that even new bug reports can indeed be used as input for generating executable
test cases. Finally, we report experimental results which confirm that
LLM-generated test cases are immediately useful in software engineering tasks
such as fault localization as well as patch validation in automated program
repair.
Related papers
- ProjectTest: A Project-level LLM Unit Test Generation Benchmark and Impact of Error Fixing Mechanisms [48.43237545197775]
Unit test generation has become a promising and important use case of LLMs.
ProjectTest is a project-level benchmark for unit test generation covering Python, Java, and JavaScript.
arXiv Detail & Related papers (2025-02-10T15:24:30Z) - Design choices made by LLM-based test generators prevent them from finding bugs [0.850206009406913]
This paper critically examines whether recent LLM-based test generation tools, such as Codium CoverAgent and CoverUp, can effectively find bugs or unintentionally validate faulty code.
Using real human-written buggy code as input, we evaluate these tools, showing how LLM-generated tests can fail to detect bugs and, more alarmingly, how their design can worsen the situation by validating bugs in the generated test suite and rejecting bug-revealing tests.
arXiv Detail & Related papers (2024-12-18T18:33:26Z) - System Test Case Design from Requirements Specifications: Insights and Challenges of Using ChatGPT [1.9282110216621835]
This paper explores the effectiveness of using Large Language Models (LLMs) to generate test case designs from Software Requirements Specification (SRS) documents.
About 87 percent of the generated test cases were valid, with the remaining 13 percent either not applicable or redundant.
arXiv Detail & Related papers (2024-12-04T20:12:27Z) - Leveraging Large Language Models for Enhancing the Understandability of Generated Unit Tests [4.574205608859157]
We introduce UTGen, which combines search-based software testing and large language models to enhance the understandability of automatically generated test cases.
We observe that participants working on assignments with UTGen test cases fix up to 33% more bugs and use up to 20% less time when compared to baseline test cases.
arXiv Detail & Related papers (2024-08-21T15:35:34Z) - GPT-HateCheck: Can LLMs Write Better Functional Tests for Hate Speech Detection? [50.53312866647302]
HateCheck is a suite for testing fine-grained model functionalities on synthesized data.
We propose GPT-HateCheck, a framework to generate more diverse and realistic functional tests from scratch.
Crowd-sourced annotation demonstrates that the generated test cases are of high quality.
arXiv Detail & Related papers (2024-02-23T10:02:01Z) - Automated Test Case Repair Using Language Models [0.5708902722746041]
Unrepaired broken test cases can degrade test suite quality and disrupt the software development process.
We present TaRGET, a novel approach leveraging pre-trained code language models for automated test case repair.
TaRGET treats test repair as a language translation task, employing a two-step process to fine-tune a language model.
arXiv Detail & Related papers (2024-01-12T18:56:57Z) - Effective Test Generation Using Pre-trained Large Language Models and
Mutation Testing [13.743062498008555]
We introduce MuTAP for improving the effectiveness of test cases generated by Large Language Models (LLMs) in terms of revealing bugs.
MuTAP is capable of generating effective test cases in the absence of natural language descriptions of the Program Under Test (PUTs)
Our results show that our proposed method is able to detect up to 28% more faulty human-written code snippets.
arXiv Detail & Related papers (2023-08-31T08:48:31Z) - Towards Automatic Generation of Amplified Regression Test Oracles [44.45138073080198]
We propose a test oracle derivation approach to amplify regression test oracles.
The approach monitors the object state during test execution and compares it to the previous version to detect any changes in relation to the SUT's intended behaviour.
arXiv Detail & Related papers (2023-07-28T12:38:44Z) - CodeT: Code Generation with Generated Tests [49.622590050797236]
We explore the use of pre-trained language models to automatically generate test cases.
CodeT executes the code solutions using the generated test cases, and then chooses the best solution.
We evaluate CodeT on five different pre-trained models with both HumanEval and MBPP benchmarks.
arXiv Detail & Related papers (2022-07-21T10:18:37Z) - Fault-Aware Neural Code Rankers [64.41888054066861]
We propose fault-aware neural code rankers that can predict the correctness of a sampled program without executing it.
Our fault-aware rankers can significantly increase the pass@1 accuracy of various code generation models.
arXiv Detail & Related papers (2022-06-04T22:01:05Z) - Beyond Accuracy: Behavioral Testing of NLP models with CheckList [66.42971817954806]
CheckList is a task-agnostic methodology for testing NLP models.
CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation.
In a user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.
arXiv Detail & Related papers (2020-05-08T15:48:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.