Evaluating Large Language Models in Detecting Test Smells
- URL: http://arxiv.org/abs/2407.19261v2
- Date: Tue, 30 Jul 2024 12:16:54 GMT
- Title: Evaluating Large Language Models in Detecting Test Smells
- Authors: Keila Lucas, Rohit Gheyi, Elvys Soares, Márcio Ribeiro, Ivan Machado,
- Abstract summary: The presence of test smells can negatively impact the maintainability and reliability of software.
This study aims to evaluate the capability of Large Language Models (LLMs) in automatically detecting test smells.
- Score: 1.5691664836504473
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Test smells are coding issues that typically arise from inadequate practices, a lack of knowledge about effective testing, or deadline pressures to complete projects. The presence of test smells can negatively impact the maintainability and reliability of software. While there are tools that use advanced static analysis or machine learning techniques to detect test smells, these tools often require effort to be used. This study aims to evaluate the capability of Large Language Models (LLMs) in automatically detecting test smells. We evaluated ChatGPT-4, Mistral Large, and Gemini Advanced using 30 types of test smells across codebases in seven different programming languages collected from the literature. ChatGPT-4 identified 21 types of test smells. Gemini Advanced identified 17 types, while Mistral Large detected 15 types of test smells. Conclusion: The LLMs demonstrated potential as a valuable tool in identifying test smells.
Related papers
- Context-Aware Testing: A New Paradigm for Model Testing with Large Language Models [49.06068319380296]
We introduce context-aware testing (CAT) which uses context as an inductive bias to guide the search for meaningful model failures.
We instantiate the first CAT system, SMART Testing, which employs large language models to hypothesize relevant and likely failures.
arXiv Detail & Related papers (2024-10-31T15:06:16Z) - Test smells in LLM-Generated Unit Tests [11.517293765116307]
This study explores the diffusion of test smells in Large Language Models generated unit test suites.
We analyze a benchmark of 20,500 LLM-generated test suites produced by four models across five prompt engineering techniques.
We identify and analyze the prevalence and co-occurrence of various test smells in both human written and LLM-generated test suites.
arXiv Detail & Related papers (2024-10-14T15:35:44Z) - xNose: A Test Smell Detector for C# [0.0]
Test smells, similar to code smells, can negatively impact both the test code and the production code being tested.
Despite extensive research on test smells in languages like Java, Scala, and Python, automated tools for detecting test smells in C# are lacking.
arXiv Detail & Related papers (2024-05-07T07:10:42Z) - A Catalog of Transformations to Remove Smells From Natural Language Tests [1.260984934917191]
Test smells can pose difficulties during testing activities, such as poor maintainability, non-deterministic behavior, and incomplete verification.
This paper introduces a catalog of transformations designed to remove seven natural language test smells and a companion tool implemented using Natural Language Processing (NLP) techniques.
arXiv Detail & Related papers (2024-04-25T19:23:24Z) - GPT-HateCheck: Can LLMs Write Better Functional Tests for Hate Speech Detection? [50.53312866647302]
HateCheck is a suite for testing fine-grained model functionalities on synthesized data.
We propose GPT-HateCheck, a framework to generate more diverse and realistic functional tests from scratch.
Crowd-sourced annotation demonstrates that the generated test cases are of high quality.
arXiv Detail & Related papers (2024-02-23T10:02:01Z) - Observation-based unit test generation at Meta [52.4716552057909]
TestGen automatically generates unit tests, carved from serialized observations of complex objects, observed during app execution.
TestGen has landed 518 tests into production, which have been executed 9,617,349 times in continuous integration, finding 5,702 faults.
Our evaluation reveals that, when carving its observations from 4,361 reliable end-to-end tests, TestGen was able to generate tests for at least 86% of the classes covered by end-to-end tests.
arXiv Detail & Related papers (2024-02-09T00:34:39Z) - Manual Tests Do Smell! Cataloging and Identifying Natural Language Test
Smells [1.43994708364763]
Test smells indicate potential problems in the design and implementation of automated software tests.
This study aims to contribute to a catalog of test smells for manual tests.
arXiv Detail & Related papers (2023-08-02T19:05:36Z) - On the Possibilities of AI-Generated Text Detection [76.55825911221434]
We argue that as machine-generated text approximates human-like quality, the sample size needed for detection bounds increases.
We test various state-of-the-art text generators, including GPT-2, GPT-3.5-Turbo, Llama, Llama-2-13B-Chat-HF, and Llama-2-70B-Chat-HF, against detectors, including oBERTa-Large/Base-Detector, GPTZero.
arXiv Detail & Related papers (2023-04-10T17:47:39Z) - Machine Learning-Based Test Smell Detection [17.957877801382413]
Test smells are symptoms of sub-optimal design choices adopted when developing test cases.
We propose the design and experimentation of a novel test smell detection approach based on machine learning to detect four test smells.
arXiv Detail & Related papers (2022-08-16T07:33:15Z) - Annotation Error Detection: Analyzing the Past and Present for a More
Coherent Future [63.99570204416711]
We reimplement 18 methods for detecting potential annotation errors and evaluate them on 9 English datasets.
We define a uniform evaluation setup including a new formalization of the annotation error detection task.
We release our datasets and implementations in an easy-to-use and open source software package.
arXiv Detail & Related papers (2022-06-05T22:31:45Z) - Beyond Accuracy: Behavioral Testing of NLP models with CheckList [66.42971817954806]
CheckList is a task-agnostic methodology for testing NLP models.
CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation.
In a user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.
arXiv Detail & Related papers (2020-05-08T15:48:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.