Unit Test Generation using Generative AI : A Comparative Performance
Analysis of Autogeneration Tools
- URL: http://arxiv.org/abs/2312.10622v2
- Date: Tue, 13 Feb 2024 15:18:29 GMT
- Title: Unit Test Generation using Generative AI : A Comparative Performance
Analysis of Autogeneration Tools
- Authors: Shreya Bhatia, Tarushi Gandhi, Dhruv Kumar, Pankaj Jalote
- Abstract summary: This research aims to experimentally investigate the effectiveness of Large Language Models (LLMs) for generating unit test scripts for Python programs.
For experiments, we consider three types of code units: 1) Procedural scripts, 2) Function-based modular code, and 3) Class-based code.
Our results show that ChatGPT's performance is comparable with Pynguin in terms of coverage, though for some cases its performance is superior to Pynguin.
- Score: 2.0686733932673604
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating unit tests is a crucial task in software development, demanding
substantial time and effort from programmers. The advent of Large Language
Models (LLMs) introduces a novel avenue for unit test script generation. This
research aims to experimentally investigate the effectiveness of LLMs,
specifically exemplified by ChatGPT, for generating unit test scripts for
Python programs, and how the generated test cases compare with those generated
by an existing unit test generator (Pynguin). For experiments, we consider
three types of code units: 1) Procedural scripts, 2) Function-based modular
code, and 3) Class-based code. The generated test cases are evaluated based on
criteria such as coverage, correctness, and readability. Our results show that
ChatGPT's performance is comparable with Pynguin in terms of coverage, though
for some cases its performance is superior to Pynguin. We also find that about
a third of assertions generated by ChatGPT for some categories were incorrect.
Our results also show that there is minimal overlap in missed statements
between ChatGPT and Pynguin, thus, suggesting that a combination of both tools
may enhance unit test generation performance. Finally, in our experiments,
prompt engineering improved ChatGPT's performance, achieving a much higher
coverage.
Related papers
- CLOVER: A Test Case Generation Benchmark with Coverage, Long-Context, and Verification [71.34070740261072]
This paper presents a benchmark, CLOVER, to evaluate models' capabilities in generating and completing test cases.
The benchmark is containerized for code execution across tasks, and we will release the code, data, and construction methodologies.
arXiv Detail & Related papers (2025-02-12T21:42:56Z) - ViUniT: Visual Unit Tests for More Robust Visual Programming [104.55763189099125]
When models answer correctly, they produce incorrect programs 33% of the time.
We propose Visual Unit Testing (ViUniT), a framework to improve the reliability of visual programs by automatically generating unit tests.
arXiv Detail & Related papers (2024-12-12T01:36:18Z) - ASTER: Natural and Multi-language Unit Test Generation with LLMs [6.259245181881262]
We describe a generic pipeline that incorporates static analysis to guide LLMs in generating compilable and high-coverage test cases.
We conduct an empirical study to assess the quality of the generated tests in terms of code coverage and test naturalness.
arXiv Detail & Related papers (2024-09-04T21:46:18Z) - Leveraging Large Language Models for Enhancing the Understandability of Generated Unit Tests [4.574205608859157]
We introduce UTGen, which combines search-based software testing and large language models to enhance the understandability of automatically generated test cases.
We observe that participants working on assignments with UTGen test cases fix up to 33% more bugs and use up to 20% less time when compared to baseline test cases.
arXiv Detail & Related papers (2024-08-21T15:35:34Z) - Large Language Models as Test Case Generators: Performance Evaluation and Enhancement [3.5398126682962587]
We study how well Large Language Models can generate high-quality test cases.
We propose a multi-agent framework called emphTestChain that decouples the generation of test inputs and test outputs.
Our results indicate that TestChain outperforms the baseline by a large margin.
arXiv Detail & Related papers (2024-04-20T10:27:01Z) - Prompting Code Interpreter to Write Better Unit Tests on Quixbugs
Functions [0.05657375260432172]
Unit testing is a commonly-used approach in software engineering to test the correctness and robustness of written code.
In this study, we explore the effect of different prompts on the quality of unit tests generated by Code Interpreter.
We find that the quality of the generated unit tests is not sensitive to changes in minor details in the prompts provided.
arXiv Detail & Related papers (2023-09-30T20:36:23Z) - FacTool: Factuality Detection in Generative AI -- A Tool Augmented
Framework for Multi-Task and Multi-Domain Scenarios [87.12753459582116]
A wider range of tasks now face an increasing risk of containing factual errors when handled by generative models.
We propose FacTool, a task and domain agnostic framework for detecting factual errors of texts generated by large language models.
arXiv Detail & Related papers (2023-07-25T14:20:51Z) - No More Manual Tests? Evaluating and Improving ChatGPT for Unit Test Generation [11.009117714870527]
Unit testing is essential in detecting bugs in functionally-discrete program units.
Recent work has shown the large potential of large language models (LLMs) in unit test generation.
It remains unclear how effective ChatGPT is in unit test generation.
arXiv Detail & Related papers (2023-05-07T07:17:08Z) - Teaching Large Language Models to Self-Debug [62.424077000154945]
Large language models (LLMs) have achieved impressive performance on code generation.
We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
arXiv Detail & Related papers (2023-04-11T10:43:43Z) - Fault-Aware Neural Code Rankers [64.41888054066861]
We propose fault-aware neural code rankers that can predict the correctness of a sampled program without executing it.
Our fault-aware rankers can significantly increase the pass@1 accuracy of various code generation models.
arXiv Detail & Related papers (2022-06-04T22:01:05Z) - Beyond Accuracy: Behavioral Testing of NLP models with CheckList [66.42971817954806]
CheckList is a task-agnostic methodology for testing NLP models.
CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation.
In a user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.
arXiv Detail & Related papers (2020-05-08T15:48:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.