Test smells in LLM-Generated Unit Tests
- URL: http://arxiv.org/abs/2410.10628v1
- Date: Mon, 14 Oct 2024 15:35:44 GMT
- Title: Test smells in LLM-Generated Unit Tests
- Authors: Wendkûuni C. Ouédraogo, Yinghua Li, Kader Kaboré, Xunzhu Tang, Anil Koyuncu, Jacques Klein, David Lo, Tegawendé F. Bissyandé,
- Abstract summary: This study explores the diffusion of test smells in Large Language Models generated unit test suites.
We analyze a benchmark of 20,500 LLM-generated test suites produced by four models across five prompt engineering techniques.
We identify and analyze the prevalence and co-occurrence of various test smells in both human written and LLM-generated test suites.
- Score: 11.517293765116307
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The use of Large Language Models (LLMs) in automated test generation is gaining popularity, with much of the research focusing on metrics like compilability rate, code coverage and bug detection. However, an equally important quality metric is the presence of test smells design flaws or anti patterns in test code that hinder maintainability and readability. In this study, we explore the diffusion of test smells in LLM generated unit test suites and compare them to those found in human written ones. We analyze a benchmark of 20,500 LLM-generated test suites produced by four models (GPT-3.5, GPT-4, Mistral 7B, and Mixtral 8x7B) across five prompt engineering techniques, alongside a dataset of 780,144 human written test suites from 34,637 projects. Leveraging TsDetect, a state of the art tool capable of detecting 21 different types of test smells, we identify and analyze the prevalence and co-occurrence of various test smells in both human written and LLM-generated test suites. Our findings reveal new insights into the strengths and limitations of LLMs in test generation. First, regarding prevalence, we observe that LLMs frequently generate tests with common test smells, such as Magic Number Test and Assertion Roulette. Second, in terms of co occurrence, certain smells, like Long Test and Useless Test, tend to co occur in LLM-generated suites, influenced by specific prompt techniques. Third, we find that project complexity and LLM specific factors, including model size and context length, significantly affect the prevalence of test smells. Finally, the patterns of test smells in LLM-generated tests often mirror those in human-written tests, suggesting potential data leakage from training datasets. These insights underscore the need to refine LLM-based test generation for cleaner code and suggest improvements in both LLM capabilities and software testing practices.
Related papers
- ProjectTest: A Project-level LLM Unit Test Generation Benchmark and Impact of Error Fixing Mechanisms [48.43237545197775]
Unit test generation has become a promising and important use case of LLMs.
ProjectTest is a project-level benchmark for unit test generation covering Python, Java, and JavaScript.
arXiv Detail & Related papers (2025-02-10T15:24:30Z) - LlamaRestTest: Effective REST API Testing with Small Language Models [50.058600784556816]
We present LlamaRestTest, a novel approach that employs two custom LLMs to generate realistic test inputs.
LlamaRestTest surpasses state-of-the-art tools in code coverage and error detection, even with RESTGPT-enhanced specifications.
arXiv Detail & Related papers (2025-01-15T05:51:20Z) - Improving the Readability of Automatically Generated Tests using Large Language Models [7.7149881834358345]
We propose to combine the effectiveness of search-based generators with the readability of LLM generated tests.
Our approach focuses on improving test and variable names produced by search-based tools, while keeping their semantics unchanged.
arXiv Detail & Related papers (2024-12-25T09:08:53Z) - AntiLeak-Bench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World Knowledge [68.39683427262335]
Existing studies fail to guarantee contamination-free evaluation as newly collected data may contain pre-existing knowledge.
We propose AntiLeak-Bench, an automated anti-leakage benchmarking framework.
arXiv Detail & Related papers (2024-12-18T09:53:12Z) - SpecTool: A Benchmark for Characterizing Errors in Tool-Use LLMs [77.79172008184415]
SpecTool is a new benchmark to identify error patterns in LLM output on tool-use tasks.
We show that even the most prominent LLMs exhibit these error patterns in their outputs.
Researchers can use the analysis and insights from SPECTOOL to guide their error mitigation strategies.
arXiv Detail & Related papers (2024-11-20T18:56:22Z) - SYNTHEVAL: Hybrid Behavioral Testing of NLP Models with Synthetic CheckLists [59.08999823652293]
We propose SYNTHEVAL to generate a wide range of test types for a comprehensive evaluation of NLP models.
In the last stage, human experts investigate the challenging examples, manually design templates, and identify the types of failures the taskspecific models consistently exhibit.
We apply SYNTHEVAL to two classification tasks, sentiment analysis and toxic language detection, and show that our framework is effective in identifying weaknesses of strong models on these tasks.
arXiv Detail & Related papers (2024-08-30T17:41:30Z) - Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation [11.056044348209483]
Unit testing, crucial for identifying bugs in code modules like classes and methods, is often neglected by developers due to time constraints.
Large Language Models (LLMs), like GPT and Mistral, show promise in software engineering, including in test generation.
arXiv Detail & Related papers (2024-06-28T20:38:41Z) - Large Language Models as Test Case Generators: Performance Evaluation and Enhancement [3.5398126682962587]
We study how well Large Language Models can generate high-quality test cases.
We propose a multi-agent framework called emphTestChain that decouples the generation of test inputs and test outputs.
Our results indicate that TestChain outperforms the baseline by a large margin.
arXiv Detail & Related papers (2024-04-20T10:27:01Z) - Code-Aware Prompting: A study of Coverage Guided Test Generation in Regression Setting using LLM [32.44432906540792]
We present SymPrompt, a code-aware prompting strategy for large language models in test generation.
SymPrompt enhances correct test generations by a factor of 5 and bolsters relative coverage by 26% for CodeGen2.
Notably, when applied to GPT-4, SymPrompt improves coverage by over 2x compared to baseline prompting strategies.
arXiv Detail & Related papers (2024-01-31T18:21:49Z) - Effective Test Generation Using Pre-trained Large Language Models and
Mutation Testing [13.743062498008555]
We introduce MuTAP for improving the effectiveness of test cases generated by Large Language Models (LLMs) in terms of revealing bugs.
MuTAP is capable of generating effective test cases in the absence of natural language descriptions of the Program Under Test (PUTs)
Our results show that our proposed method is able to detect up to 28% more faulty human-written code snippets.
arXiv Detail & Related papers (2023-08-31T08:48:31Z) - LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits.
Most LLMs struggle on SummEdits, with performance close to random chance.
The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.