Related papers: Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation

Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation

URL: http://arxiv.org/abs/2407.00225v1
Date: Fri, 28 Jun 2024 20:38:41 GMT
Title: Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation
Authors: Wendkûuni C. Ouédraogo, Kader Kaboré, Haoye Tian, Yewei Song, Anil Koyuncu, Jacques Klein, David Lo, Tegawendé F. Bissyandé,
Abstract summary: Unit testing, crucial for identifying bugs in code modules like classes and methods, is often neglected by developers due to time constraints. Large Language Models (LLMs), like GPT and Mistral, show promise in software engineering, including in test generation.
Score: 11.056044348209483
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Unit testing, crucial for identifying bugs in code modules like classes and methods, is often neglected by developers due to time constraints. Automated test generation techniques have emerged to address this, but often lack readability and require developer intervention. Large Language Models (LLMs), like GPT and Mistral, show promise in software engineering, including in test generation. However, their effectiveness remains unclear. This study conducts the first comprehensive investigation of LLMs, evaluating the effectiveness of four LLMs and five prompt engineering techniques, for unit test generation. We analyze 216\,300 tests generated by the selected advanced instruct-tuned LLMs for 690 Java classes collected from diverse datasets. We assess correctness, understandability, coverage, and bug detection capabilities of LLM-generated tests, comparing them to EvoSuite, a popular automated testing tool. While LLMs show potential, improvements in test correctness are necessary. This study reveals the strengths and limitations of LLMs compared to traditional methods, paving the way for further research on LLMs in software engineering.

Related papers

Benchmarking Large Language Models for Multi-Language Software Vulnerability Detection [15.026084450436976]
We present a study evaluating the performance of large language models (LLMs) on the software vulnerability detection task. We have compiled a dataset comprising 8,260 vulnerable functions in Python, 7,505 in Java, and 28,983 in JavaScript. These LLMs are benchmarked against five fine-tuned small language models and two open-source static application security testing tools.
arXiv Detail & Related papers (2025-03-03T11:56:00Z)
Unit Testing Past vs. Present: Examining LLMs' Impact on Defect Detection and Efficiency [2.4936576553283283]
The integration of Large Language Models (LLMs) into software engineering has shown potential to enhance productivity. This paper investigates whether LLM support improves defect detection effectiveness during unit testing.
arXiv Detail & Related papers (2025-02-13T22:27:55Z)
ProjectTest: A Project-level LLM Unit Test Generation Benchmark and Impact of Error Fixing Mechanisms [48.43237545197775]
Unit test generation has become a promising and important use case of LLMs. ProjectTest is a project-level benchmark for unit test generation covering Python, Java, and JavaScript.
arXiv Detail & Related papers (2025-02-10T15:24:30Z)
A Large-scale Empirical Study on Fine-tuning Large Language Models for Unit Testing [8.22619177301814]
Large Language Models (LLMs) have shown potential in various unit testing tasks. We present a large-scale empirical study on fine-tuning LLMs for unit testing.
arXiv Detail & Related papers (2024-12-21T13:28:11Z)
Test smells in LLM-Generated Unit Tests [11.517293765116307]
This study explores the diffusion of test smells in Large Language Models generated unit test suites. We analyze a benchmark of 20,500 LLM-generated test suites produced by four models across five prompt engineering techniques. We identify and analyze the prevalence and co-occurrence of various test smells in both human written and LLM-generated test suites.
arXiv Detail & Related papers (2024-10-14T15:35:44Z)
TestBench: Evaluating Class-Level Test Case Generation Capability of Large Language Models [8.22619177301814]
We introduce TestBench, a benchmark for class-level LLM-based test case generation. We construct a dataset of 108 Java programs from 9 real-world, large-scale projects on GitHub. We propose a fine-grained evaluation framework that considers five aspects of test cases: syntactic correctness, compilation correctness, test correctness, code coverage rate, and defect detection rate.
arXiv Detail & Related papers (2024-09-26T06:18:06Z)
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions. We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types. We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z)
On the Evaluation of Large Language Models in Unit Test Generation [16.447000441006814]
Unit testing is an essential activity in software development for verifying the correctness of software components. The emergence of Large Language Models (LLMs) offers a new direction for automating unit test generation.
arXiv Detail & Related papers (2024-06-26T08:57:03Z)
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions [72.56339136017759]
We introduce BigCodeBench, a benchmark that challenges Large Language Models (LLMs) to invoke multiple function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained tasks. Our evaluation shows that LLMs are not yet capable of following complex instructions to use function calls precisely, with scores up to 60%, significantly lower than the human performance of 97%. We propose a natural-language-oriented variant of BigCodeBench, BigCodeBench-Instruct, that automatically transforms the original docstrings into short instructions only with essential information.
arXiv Detail & Related papers (2024-06-22T15:52:04Z)
Prompting Large Language Models to Tackle the Full Software Development Lifecycle: A Case Study [72.24266814625685]
We explore the performance of large language models (LLMs) across the entire software development lifecycle with DevEval. DevEval features four programming languages, multiple domains, high-quality data collection, and carefully designed and verified metrics for each task. Empirical studies show that current LLMs, including GPT-4, fail to solve the challenges presented within DevEval.
arXiv Detail & Related papers (2024-03-13T15:13:44Z)
LLM Inference Unveiled: Survey and Roofline Model Insights [62.92811060490876]
Large Language Model (LLM) inference is rapidly evolving, presenting a unique blend of opportunities and challenges. Our survey stands out from traditional literature reviews by not only summarizing the current state of research but also by introducing a framework based on roofline model. This framework identifies the bottlenecks when deploying LLMs on hardware devices and provides a clear understanding of practical problems.
arXiv Detail & Related papers (2024-02-26T07:33:05Z)
Small Models, Big Insights: Leveraging Slim Proxy Models To Decide When and What to Retrieve for LLMs [60.40396361115776]
This paper introduces a novel collaborative approach, namely SlimPLM, that detects missing knowledge in large language models (LLMs) with a slim proxy model. We employ a proxy model which has far fewer parameters, and take its answers as answers. Heuristic answers are then utilized to predict the knowledge required to answer the user question, as well as the known and unknown knowledge within the LLM.
arXiv Detail & Related papers (2024-02-19T11:11:08Z)
Large Language Models: A Survey [69.72787936480394]
Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks. LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data.
arXiv Detail & Related papers (2024-02-09T05:37:09Z)
Software Testing with Large Language Models: Survey, Landscape, and Vision [32.34617250991638]
Pre-trained large language models (LLMs) have emerged as a breakthrough technology in natural language processing and artificial intelligence. This paper provides a comprehensive review of the utilization of LLMs in software testing.
arXiv Detail & Related papers (2023-07-14T08:26:12Z)
ChatGPT vs SBST: A Comparative Assessment of Unit Test Suite Generation [25.200080365022153]
We present a systematic comparison of test suites generated by the ChatGPT LLM and the state-of-the-art SBST tool EvoSuite. Our comparison is based on several critical factors, including correctness, readability, code coverage, and bug detection capability.
arXiv Detail & Related papers (2023-07-02T15:09:40Z)
Self-Checker: Plug-and-Play Modules for Fact-Checking with Large Language Models [75.75038268227554]
Self-Checker is a framework comprising a set of plug-and-play modules that facilitate fact-checking. This framework provides a fast and efficient way to construct fact-checking systems in low-resource environments.
arXiv Detail & Related papers (2023-05-24T01:46:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.