Improving LLM-based Unit test generation via Template-based Repair
- URL: http://arxiv.org/abs/2408.03095v4
- Date: Tue, 5 Nov 2024 12:57:35 GMT
- Title: Improving LLM-based Unit test generation via Template-based Repair
- Authors: Siqi Gu, Chunrong Fang, Quanjun Zhang, Fangyuan Tian, Jianyi Zhou, Zhenyu Chen,
- Abstract summary: Unit test is crucial for detecting bugs in individual program units but consumes time and effort.
Large language models (LLMs) have demonstrated remarkable reasoning and generation capabilities.
In this paper, we propose TestART, a novel unit test generation method.
- Score: 8.22619177301814
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Unit test is crucial for detecting bugs in individual program units but consumes time and effort. The existing automated unit test generation methods are mainly based on search-based software testing (SBST) and language models to liberate developers. Recently, large language models (LLMs) have demonstrated remarkable reasoning and generation capabilities. However, several problems limit their ability to generate high-quality test cases: (1) LLMs may generate invalid test cases under insufficient context, resulting in compilation errors; (2) Lack of test and coverage feedback information may cause runtime errors and low coverage rates. (3) The repetitive suppression problem causes LLMs to get stuck into the repetition loop of self-repair or re-generation attempts. In this paper, we propose TestART, a novel unit test generation method that leverages the strengths of LLMs while overcoming the limitations mentioned. TestART improves LLM-based unit test via co-evolution of automated generation and repair iteration. TestART leverages the template-based repair technique to fix bugs in LLM-generated test cases, using prompt injection to guide the next-step automated generation and avoid repetition suppression. Furthermore, TestART extracts coverage information from the passed test cases and utilizes it as testing feedback to enhance the sufficiency of the final test case. This synergy between generation and repair elevates the quality, effectiveness, and readability of the produced test cases significantly beyond previous methods. In comparative experiments, the pass rate of TestART-generated test cases is 78.55%, which is approximately 18% higher than both the ChatGPT-4.0 model and the same ChatGPT-3.5-based method ChatUniTest. It also achieves an impressive line coverage rate of 90.96% on the focal methods that passed the test, exceeding EvoSuite by 3.4%.
Related papers
- Context-Aware Testing: A New Paradigm for Model Testing with Large Language Models [49.06068319380296]
We introduce context-aware testing (CAT) which uses context as an inductive bias to guide the search for meaningful model failures.
We instantiate the first CAT system, SMART Testing, which employs large language models to hypothesize relevant and likely failures.
arXiv Detail & Related papers (2024-10-31T15:06:16Z) - Test smells in LLM-Generated Unit Tests [11.517293765116307]
This study explores the diffusion of test smells in Large Language Models generated unit test suites.
We analyze a benchmark of 20,500 LLM-generated test suites produced by four models across five prompt engineering techniques.
We identify and analyze the prevalence and co-occurrence of various test smells in both human written and LLM-generated test suites.
arXiv Detail & Related papers (2024-10-14T15:35:44Z) - Test Oracle Automation in the era of LLMs [52.69509240442899]
Large Language Models (LLMs) have demonstrated remarkable proficiency in tackling diverse software testing tasks.
This paper aims to enable discussions on the potential of using LLMs for test oracle automation, along with the challenges that may emerge during the generation of various types of oracles.
arXiv Detail & Related papers (2024-05-21T13:19:10Z) - Large Language Models as Test Case Generators: Performance Evaluation and Enhancement [3.5398126682962587]
We study how well Large Language Models can generate high-quality test cases.
We propose a multi-agent framework called emphTestChain that decouples the generation of test inputs and test outputs.
Our results indicate that TestChain outperforms the baseline by a large margin.
arXiv Detail & Related papers (2024-04-20T10:27:01Z) - GPT-HateCheck: Can LLMs Write Better Functional Tests for Hate Speech Detection? [50.53312866647302]
HateCheck is a suite for testing fine-grained model functionalities on synthesized data.
We propose GPT-HateCheck, a framework to generate more diverse and realistic functional tests from scratch.
Crowd-sourced annotation demonstrates that the generated test cases are of high quality.
arXiv Detail & Related papers (2024-02-23T10:02:01Z) - Observation-based unit test generation at Meta [52.4716552057909]
TestGen automatically generates unit tests, carved from serialized observations of complex objects, observed during app execution.
TestGen has landed 518 tests into production, which have been executed 9,617,349 times in continuous integration, finding 5,702 faults.
Our evaluation reveals that, when carving its observations from 4,361 reliable end-to-end tests, TestGen was able to generate tests for at least 86% of the classes covered by end-to-end tests.
arXiv Detail & Related papers (2024-02-09T00:34:39Z) - Code-Aware Prompting: A study of Coverage Guided Test Generation in Regression Setting using LLM [32.44432906540792]
We present SymPrompt, a code-aware prompting strategy for large language models in test generation.
SymPrompt enhances correct test generations by a factor of 5 and bolsters relative coverage by 26% for CodeGen2.
Notably, when applied to GPT-4, SymPrompt improves coverage by over 2x compared to baseline prompting strategies.
arXiv Detail & Related papers (2024-01-31T18:21:49Z) - Effective Test Generation Using Pre-trained Large Language Models and
Mutation Testing [13.743062498008555]
We introduce MuTAP for improving the effectiveness of test cases generated by Large Language Models (LLMs) in terms of revealing bugs.
MuTAP is capable of generating effective test cases in the absence of natural language descriptions of the Program Under Test (PUTs)
Our results show that our proposed method is able to detect up to 28% more faulty human-written code snippets.
arXiv Detail & Related papers (2023-08-31T08:48:31Z) - Towards Automatic Generation of Amplified Regression Test Oracles [44.45138073080198]
We propose a test oracle derivation approach to amplify regression test oracles.
The approach monitors the object state during test execution and compares it to the previous version to detect any changes in relation to the SUT's intended behaviour.
arXiv Detail & Related papers (2023-07-28T12:38:44Z) - No More Manual Tests? Evaluating and Improving ChatGPT for Unit Test Generation [11.009117714870527]
Unit testing is essential in detecting bugs in functionally-discrete program units.
Recent work has shown the large potential of large language models (LLMs) in unit test generation.
It remains unclear how effective ChatGPT is in unit test generation.
arXiv Detail & Related papers (2023-05-07T07:17:08Z) - CodeT: Code Generation with Generated Tests [49.622590050797236]
We explore the use of pre-trained language models to automatically generate test cases.
CodeT executes the code solutions using the generated test cases, and then chooses the best solution.
We evaluate CodeT on five different pre-trained models with both HumanEval and MBPP benchmarks.
arXiv Detail & Related papers (2022-07-21T10:18:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.