Related papers: Evaluating LLMs Effectiveness in Detecting and Correcting Test Smells: An Empirical Study

Evaluating LLMs Effectiveness in Detecting and Correcting Test Smells: An Empirical Study

URL: http://arxiv.org/abs/2506.07594v1
Date: Mon, 09 Jun 2025 09:46:41 GMT
Title: Evaluating LLMs Effectiveness in Detecting and Correcting Test Smells: An Empirical Study
Authors: E. G. Santana Jr, Jander Pereira Santos Junior, Erlon P. Almeida, Iftekhar Ahmed, Paulo Anselmo da Mota Silveira Neto, Eduardo Santana de Almeida,
Abstract summary: Test smells indicate poor development practices in test code, reducing maintainability and reliability.<n>We evaluated GPT-4-TurboNose, LLaMA 3 70B, and Gemini-1.5 Pro on Python and Java test suites.
Score: 6.373038973241454
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Test smells indicate poor development practices in test code, reducing maintainability and reliability. While developers often struggle to prevent or refactor these issues, existing tools focus primarily on detection rather than automated refactoring. Large Language Models (LLMs) have shown strong potential in code understanding and transformation, but their ability to both identify and refactor test smells remains underexplored. We evaluated GPT-4-Turbo, LLaMA 3 70B, and Gemini-1.5 Pro on Python and Java test suites, using PyNose and TsDetect for initial smell detection, followed by LLM-driven refactoring. Gemini achieved the highest detection accuracy (74.35\% Python, 80.32\% Java), while LLaMA was lowest. All models could refactor smells, but effectiveness varied, sometimes introducing new smells. Gemini also improved test coverage, unlike GPT-4 and LLaMA, which often reduced it. These results highlight LLMs' potential for automated test smell refactoring, with Gemini as the strongest performer, though challenges remain across languages and smell types.

Related papers

Quality Assessment of Python Tests Generated by Large Language Models [1.0845500038686533]
This study investigates the quality of Python test code generated by three Large Language Models: GPT-4o, Amazon Q, and LLama 3.3.<n>We evaluate the structural reliability of test suites generated under two distinct prompt contexts: Text2Code (T2C) and Code2Code (C2C)
arXiv Detail & Related papers (2025-06-17T08:16:15Z)
Agentic SLMs: Hunting Down Test Smells [4.5274260758457645]
Test smells can compromise the reliability of test suites and hinder software maintenance.<n>This study evaluates LLAMA 3.2 3B, GEMMA 2 9B, DEEPSEEK-R1 14B, and PHI 4 14B - small, open language models.<n>We explore with one, two, and four agents across 150 instances of 5 common test smell types extracted from real-world Java projects.
arXiv Detail & Related papers (2025-04-09T21:12:01Z)
Evaluating the Effectiveness of Small Language Models in Detecting Refactoring Bugs [0.6133301815445301]
This study evaluates the effectiveness of Small Language Models (SLMs) in detecting two types of bugs in Java and Python.<n>The study covers 16 types and employs zero-shot prompting on consumer-grade hardware to evaluate the models' ability to reason about correctness without explicit prior training.<n>The proprietary o3-mini-high model achieved the highest detection rate, identifying 84.3% of Type I bugs.
arXiv Detail & Related papers (2025-02-25T18:52:28Z)
Automated Refactoring of Non-Idiomatic Python Code: A Differentiated Replication with LLMs [54.309127753635366]
We present the results of a replication study in which we investigate GPT-4 effectiveness in recommending and suggesting idiomatic actions.<n>Our findings underscore the potential of LLMs to achieve tasks where, in the past, implementing recommenders based on complex code analyses was required.
arXiv Detail & Related papers (2025-01-28T15:41:54Z)
Test smells in LLM-Generated Unit Tests [11.517293765116307]
This study explores the diffusion of test smells in Large Language Models generated unit test suites. We analyze a benchmark of 20,500 LLM-generated test suites produced by four models across five prompt engineering techniques. We identify and analyze the prevalence and co-occurrence of various test smells in both human written and LLM-generated test suites.
arXiv Detail & Related papers (2024-10-14T15:35:44Z)
Automated Unit Test Refactoring [10.847400457238423]
Test smells arise from poor design practices and insufficient domain knowledge.<n>We propose UTRefactor, a context-enhanced, LLM-based framework for automatic test in Java projects.<n>We evaluate UTRefactor on 879 tests from six open-source Java projects, reducing the number of test smells from 2,375 to 265, achieving an 89% reduction.
arXiv Detail & Related papers (2024-09-25T08:42:29Z)
Evaluating Large Language Models in Detecting Test Smells [1.5691664836504473]
The presence of test smells can negatively impact the maintainability and reliability of software. This study aims to evaluate the capability of Large Language Models (LLMs) in automatically detecting test smells.
arXiv Detail & Related papers (2024-07-27T14:00:05Z)
AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models [95.09157454599605]
Large Language Models (LLMs) are becoming increasingly powerful, but they still exhibit significant but subtle weaknesses.<n>Traditional benchmarking approaches cannot thoroughly pinpoint specific model deficiencies.<n>We introduce a unified framework, AutoDetect, to automatically expose weaknesses in LLMs across various tasks.
arXiv Detail & Related papers (2024-06-24T15:16:45Z)
A Comprehensive Survey of Contamination Detection Methods in Large Language Models [68.10605098856087]
With the rise of Large Language Models (LLMs) in recent years, abundant new opportunities are emerging, but also new challenges.<n>LLMs' performance may not be reliable anymore, as the high performance may be at least partly due to their previous exposure to the data.<n>This limitation jeopardizes real capability improvement in the field of NLP, yet there remains a lack of methods on how to efficiently detect contamination.
arXiv Detail & Related papers (2024-03-31T14:32:02Z)
Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes [61.916827858666906]
Large Language Models (LLMs) are becoming a prominent generative AI tool, where the user enters a query and the LLM generates an answer. To reduce harm and misuse, efforts have been made to align these LLMs to human values using advanced training techniques such as Reinforcement Learning from Human Feedback. Recent studies have highlighted the vulnerability of LLMs to adversarial jailbreak attempts aiming at subverting the embedded safety guardrails. This paper proposes a method called Gradient Cuff to detect jailbreak attempts.
arXiv Detail & Related papers (2024-03-01T03:29:54Z)
DebugBench: Evaluating Debugging Capability of Large Language Models [80.73121177868357]
DebugBench is a benchmark for Large Language Models (LLMs) It covers four major bug categories and 18 minor types in C++, Java, and Python. We evaluate two commercial and four open-source models in a zero-shot scenario.
arXiv Detail & Related papers (2024-01-09T15:46:38Z)
LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits. Most LLMs struggle on SummEdits, with performance close to random chance. The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.