TestART: Improving LLM-based Unit Testing via Co-evolution of Automated Generation and Repair Iteration
- URL: http://arxiv.org/abs/2408.03095v5
- Date: Sat, 21 Dec 2024 12:51:04 GMT
- Title: TestART: Improving LLM-based Unit Testing via Co-evolution of Automated Generation and Repair Iteration
- Authors: Siqi Gu, Quanjun Zhang, Chunrong Fang, Fangyuan Tian, Liuchuan Zhu, Jianyi Zhou, Zhenyu Chen,
- Abstract summary: Large language models (LLMs) have demonstrated remarkable capabilities in generating unit test cases.
We propose TestART, a novel unit test generation method.
TestART improves LLM-based unit testing via co-evolution of automated generation and repair iteration.
- Score: 7.833381226332574
- License:
- Abstract: Unit testing is crucial for detecting bugs in individual program units but consumes time and effort. Recently, large language models (LLMs) have demonstrated remarkable capabilities in generating unit test cases. However, several problems limit their ability to generate high-quality unit test cases: (1) compilation and runtime errors caused by the hallucination of LLMs; (2) lack of testing and coverage feedback information restricting the increase of code coverage;(3) the repetitive suppression problem causing invalid LLM-based repair and generation attempts. To address these limitations, we propose TestART, a novel unit test generation method. TestART improves LLM-based unit testing via co-evolution of automated generation and repair iteration, representing a significant advancement in automated unit test generation. TestART leverages the template-based repair strategy to effectively fix bugs in LLM-generated test cases for the first time. Meanwhile, TestART extracts coverage information from successful test cases and uses it as coverage-guided testing feedback. It also incorporates positive prompt injection to prevent repetition suppression, thereby enhancing the sufficiency of the final test case. This synergy between generation and repair elevates the correctness and sufficiency of the produced test cases significantly beyond previous methods. In comparative experiments, TestART demonstrates an 18% improvement in pass rate and a 20% enhancement in coverage across three types of datasets compared to baseline models. Additionally, it achieves better coverage rates than EvoSuite with only half the number of test cases. These results demonstrate TestART's superior ability to produce high-quality unit test cases by harnessing the power of LLMs while overcoming their inherent flaws.
Related papers
- ProjectTest: A Project-level LLM Unit Test Generation Benchmark and Impact of Error Fixing Mechanisms [48.43237545197775]
Unit test generation has become a promising and important use case of LLMs.
ProjectTest is a project-level benchmark for unit test generation covering Python, Java, and JavaScript.
arXiv Detail & Related papers (2025-02-10T15:24:30Z) - Learning to Generate Unit Tests for Automated Debugging [52.63217175637201]
Unit tests (UTs) play an instrumental role in assessing code correctness as well as providing feedback to a large language model (LLM)
We propose UTGen, which teaches LLMs to generate unit test inputs that reveal errors along with their correct expected outputs.
We show that UTGen outperforms UT generation baselines by 7.59% based on a metric measuring the presence of both error-revealing UT inputs and correct UT outputs.
arXiv Detail & Related papers (2025-02-03T18:51:43Z) - AugmenTest: Enhancing Tests with LLM-Driven Oracles [2.159639193866661]
AugmenTest is an approach leveraging Large Language Models to infer correct test oracles based on available documentation of the software under test.
AugmenTest includes four variants: Simple Prompt, Extended Prompt, RAG with a generic prompt (without the context of class or method under test), and RAG with Simple Prompt, each offering different levels of contextual information to the LLMs.
Results show that in the most conservative scenario, AugmenTest's Extended Prompt consistently outperformed the Simple Prompt, achieving a success rate of 30% for generating correct assertions.
arXiv Detail & Related papers (2025-01-29T07:45:41Z) - LlamaRestTest: Effective REST API Testing with Small Language Models [50.058600784556816]
We present LlamaRestTest, a novel approach that employs two custom LLMs to generate realistic test inputs.
LlamaRestTest surpasses state-of-the-art tools in code coverage and error detection, even with RESTGPT-enhanced specifications.
arXiv Detail & Related papers (2025-01-15T05:51:20Z) - Dynamic Scaling of Unit Tests for Code Reward Modeling [27.349232888627558]
Current large language models (LLMs) often struggle to produce accurate responses on the first attempt for complex reasoning tasks like code generation.
We propose CodeRM-8B, a lightweight yet effective unit test generator that enables efficient and high-quality unit test scaling.
arXiv Detail & Related papers (2025-01-02T04:33:31Z) - Large Language Models as Test Case Generators: Performance Evaluation and Enhancement [3.5398126682962587]
We study how well Large Language Models can generate high-quality test cases.
We propose a multi-agent framework called emphTestChain that decouples the generation of test inputs and test outputs.
Our results indicate that TestChain outperforms the baseline by a large margin.
arXiv Detail & Related papers (2024-04-20T10:27:01Z) - GPT-HateCheck: Can LLMs Write Better Functional Tests for Hate Speech Detection? [50.53312866647302]
HateCheck is a suite for testing fine-grained model functionalities on synthesized data.
We propose GPT-HateCheck, a framework to generate more diverse and realistic functional tests from scratch.
Crowd-sourced annotation demonstrates that the generated test cases are of high quality.
arXiv Detail & Related papers (2024-02-23T10:02:01Z) - Observation-based unit test generation at Meta [52.4716552057909]
TestGen automatically generates unit tests, carved from serialized observations of complex objects, observed during app execution.
TestGen has landed 518 tests into production, which have been executed 9,617,349 times in continuous integration, finding 5,702 faults.
Our evaluation reveals that, when carving its observations from 4,361 reliable end-to-end tests, TestGen was able to generate tests for at least 86% of the classes covered by end-to-end tests.
arXiv Detail & Related papers (2024-02-09T00:34:39Z) - Effective Test Generation Using Pre-trained Large Language Models and
Mutation Testing [13.743062498008555]
We introduce MuTAP for improving the effectiveness of test cases generated by Large Language Models (LLMs) in terms of revealing bugs.
MuTAP is capable of generating effective test cases in the absence of natural language descriptions of the Program Under Test (PUTs)
Our results show that our proposed method is able to detect up to 28% more faulty human-written code snippets.
arXiv Detail & Related papers (2023-08-31T08:48:31Z) - Towards Automatic Generation of Amplified Regression Test Oracles [44.45138073080198]
We propose a test oracle derivation approach to amplify regression test oracles.
The approach monitors the object state during test execution and compares it to the previous version to detect any changes in relation to the SUT's intended behaviour.
arXiv Detail & Related papers (2023-07-28T12:38:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.