Unit Test Generation using Generative AI : A Comparative Performance
Analysis of Autogeneration Tools
- URL: http://arxiv.org/abs/2312.10622v2
- Date: Tue, 13 Feb 2024 15:18:29 GMT
- Title: Unit Test Generation using Generative AI : A Comparative Performance
Analysis of Autogeneration Tools
- Authors: Shreya Bhatia, Tarushi Gandhi, Dhruv Kumar, Pankaj Jalote
- Abstract summary: This research aims to experimentally investigate the effectiveness of Large Language Models (LLMs) for generating unit test scripts for Python programs.
For experiments, we consider three types of code units: 1) Procedural scripts, 2) Function-based modular code, and 3) Class-based code.
Our results show that ChatGPT's performance is comparable with Pynguin in terms of coverage, though for some cases its performance is superior to Pynguin.
- Score: 2.0686733932673604
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating unit tests is a crucial task in software development, demanding
substantial time and effort from programmers. The advent of Large Language
Models (LLMs) introduces a novel avenue for unit test script generation. This
research aims to experimentally investigate the effectiveness of LLMs,
specifically exemplified by ChatGPT, for generating unit test scripts for
Python programs, and how the generated test cases compare with those generated
by an existing unit test generator (Pynguin). For experiments, we consider
three types of code units: 1) Procedural scripts, 2) Function-based modular
code, and 3) Class-based code. The generated test cases are evaluated based on
criteria such as coverage, correctness, and readability. Our results show that
ChatGPT's performance is comparable with Pynguin in terms of coverage, though
for some cases its performance is superior to Pynguin. We also find that about
a third of assertions generated by ChatGPT for some categories were incorrect.
Our results also show that there is minimal overlap in missed statements
between ChatGPT and Pynguin, thus, suggesting that a combination of both tools
may enhance unit test generation performance. Finally, in our experiments,
prompt engineering improved ChatGPT's performance, achieving a much higher
coverage.
Related papers
- Multi-language Unit Test Generation using LLMs [6.259245181881262]
We describe a generic pipeline that incorporates static analysis to guide LLMs in generating compilable and high-coverage test cases.
We show how the pipeline can be applied to different programming languages, specifically Java and Python, and to complex software requiring environment mocking.
Our results demonstrate that LLM-based test generation, when guided by static analysis, can be competitive with, and even outperform, state-of-the-art test-generation techniques in coverage achieved.
arXiv Detail & Related papers (2024-09-04T21:46:18Z) - Leveraging Large Language Models for Enhancing the Understandability of Generated Unit Tests [4.574205608859157]
We introduce UTGen, which combines search-based software testing and large language models to enhance the understandability of automatically generated test cases.
We observe that participants working on assignments with UTGen test cases fix up to 33% more bugs and use up to 20% less time when compared to baseline test cases.
arXiv Detail & Related papers (2024-08-21T15:35:34Z) - Large Language Models as Test Case Generators: Performance Evaluation and Enhancement [3.5398126682962587]
We study how well Large Language Models can generate high-quality test cases.
We propose a multi-agent framework called emphTestChain that decouples the generation of test inputs and test outputs.
Our results indicate that TestChain outperforms the baseline by a large margin.
arXiv Detail & Related papers (2024-04-20T10:27:01Z) - Prompting Code Interpreter to Write Better Unit Tests on Quixbugs
Functions [0.05657375260432172]
Unit testing is a commonly-used approach in software engineering to test the correctness and robustness of written code.
In this study, we explore the effect of different prompts on the quality of unit tests generated by Code Interpreter.
We find that the quality of the generated unit tests is not sensitive to changes in minor details in the prompts provided.
arXiv Detail & Related papers (2023-09-30T20:36:23Z) - FacTool: Factuality Detection in Generative AI -- A Tool Augmented
Framework for Multi-Task and Multi-Domain Scenarios [87.12753459582116]
A wider range of tasks now face an increasing risk of containing factual errors when handled by generative models.
We propose FacTool, a task and domain agnostic framework for detecting factual errors of texts generated by large language models.
arXiv Detail & Related papers (2023-07-25T14:20:51Z) - No More Manual Tests? Evaluating and Improving ChatGPT for Unit Test Generation [11.009117714870527]
Unit testing is essential in detecting bugs in functionally-discrete program units.
Recent work has shown the large potential of large language models (LLMs) in unit test generation.
It remains unclear how effective ChatGPT is in unit test generation.
arXiv Detail & Related papers (2023-05-07T07:17:08Z) - Teaching Large Language Models to Self-Debug [62.424077000154945]
Large language models (LLMs) have achieved impressive performance on code generation.
We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
arXiv Detail & Related papers (2023-04-11T10:43:43Z) - BiasTestGPT: Using ChatGPT for Social Bias Testing of Language Models [73.29106813131818]
bias testing is currently cumbersome since the test sentences are generated from a limited set of manual templates or need expensive crowd-sourcing.
We propose using ChatGPT for the controllable generation of test sentences, given any arbitrary user-specified combination of social groups and attributes.
We present an open-source comprehensive bias testing framework (BiasTestGPT), hosted on HuggingFace, that can be plugged into any open-source PLM for bias testing.
arXiv Detail & Related papers (2023-02-14T22:07:57Z) - CodeT: Code Generation with Generated Tests [49.622590050797236]
We explore the use of pre-trained language models to automatically generate test cases.
CodeT executes the code solutions using the generated test cases, and then chooses the best solution.
We evaluate CodeT on five different pre-trained models with both HumanEval and MBPP benchmarks.
arXiv Detail & Related papers (2022-07-21T10:18:37Z) - Fault-Aware Neural Code Rankers [64.41888054066861]
We propose fault-aware neural code rankers that can predict the correctness of a sampled program without executing it.
Our fault-aware rankers can significantly increase the pass@1 accuracy of various code generation models.
arXiv Detail & Related papers (2022-06-04T22:01:05Z) - Beyond Accuracy: Behavioral Testing of NLP models with CheckList [66.42971817954806]
CheckList is a task-agnostic methodology for testing NLP models.
CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation.
In a user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.
arXiv Detail & Related papers (2020-05-08T15:48:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.