Related papers: Agentic SLMs: Hunting Down Test Smells

Agentic SLMs: Hunting Down Test Smells

URL: http://arxiv.org/abs/2504.07277v1
Date: Wed, 09 Apr 2025 21:12:01 GMT
Title: Agentic SLMs: Hunting Down Test Smells
Authors: Rian Melo, Pedro Simões, Rohit Gheyi, Marcelo d'Amorim, Márcio Ribeiro, Gustavo Soares, Eduardo Almeida, Elvys Soares,
Abstract summary: Test smells can compromise the reliability of test suites and hinder software maintenance.<n>This study evaluates LLAMA 3.2 3B, GEMMA 2 9B, DEEPSEEK-R1 14B, and PHI 4 14B - small, open language models.<n>We explore with one, two, and four agents across 150 instances of 5 common test smell types extracted from real-world Java projects.
Score: 4.5274260758457645
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Test smells can compromise the reliability of test suites and hinder software maintenance. Although several strategies exist for detecting test smells, few address their removal. Traditional methods often rely on static analysis or machine learning, requiring significant effort and expertise. This study evaluates LLAMA 3.2 3B, GEMMA 2 9B, DEEPSEEK-R1 14B, and PHI 4 14B - small, open language models - for automating the detection and refactoring of test smells through agent-based workflows. We explore workflows with one, two, and four agents across 150 instances of 5 common test smell types extracted from real-world Java projects. Unlike prior approaches, ours is easily extensible to new smells via natural language definitions and generalizes to Python and Golang. All models detected nearly all test smell instances (pass@5 of 96% with four agents), with PHI 4 14B achieving the highest refactoring accuracy (pass@5 of 75.3%). Analyses were computationally inexpensive and ran efficiently on a consumer-grade hardware. Notably, PHI 4 14B with four agents performed within 5% of proprietary models such as O1-MINI, O3-MINI-HIGH, and GEMINI 2.5 PRO EXPERIMENTAL using a single agent. Multi-agent setups outperformed single-agent ones in three out of five test smell types, highlighting their potential to improve software quality with minimal developer effort. For the Assertion Roulette smell, however, a single agent performed better. To assess practical relevance, we submitted 10 pull requests with PHI 4 14B - generated code to open-source projects. Five were merged, one was rejected, and four remain under review, demonstrating the approach's real-world applicability.

Related papers

Quality Assessment of Python Tests Generated by Large Language Models [1.0845500038686533]
This study investigates the quality of Python test code generated by three Large Language Models: GPT-4o, Amazon Q, and LLama 3.3.<n>We evaluate the structural reliability of test suites generated under two distinct prompt contexts: Text2Code (T2C) and Code2Code (C2C)
arXiv Detail & Related papers (2025-06-17T08:16:15Z)
Evaluating LLMs Effectiveness in Detecting and Correcting Test Smells: An Empirical Study [6.373038973241454]
Test smells indicate poor development practices in test code, reducing maintainability and reliability.<n>We evaluated GPT-4-TurboNose, LLaMA 3 70B, and Gemini-1.5 Pro on Python and Java test suites.
arXiv Detail & Related papers (2025-06-09T09:46:41Z)
Runaway is Ashamed, But Helpful: On the Early-Exit Behavior of Large Language Model-based Agents in Embodied Environments [55.044159987218436]
Large language models (LLMs) have demonstrated strong planning and decision-making capabilities in complex embodied environments.<n>We take a first step toward exploring the early-exit behavior for LLM-based agents.
arXiv Detail & Related papers (2025-05-23T08:23:36Z)
Distilling LLM Agent into Small Models with Retrieval and Code Tools [57.61747522001781]
Agent Distillation is a framework for transferring reasoning capability and task-solving behavior from large language models into small language models.<n>Our results show that sLMs as small as 0.5B, 1.5B, 3B parameters can achieve performance competitive with next-tier larger 1.5B, 3B, 7B models.
arXiv Detail & Related papers (2025-05-23T08:20:15Z)
Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems [50.29939179830491]
Failure attribution in LLM multi-agent systems remains underexplored and labor-intensive.<n>We develop and evaluate three automated failure attribution methods, summarizing their corresponding pros and cons.<n>The best method achieves 53.5% accuracy in identifying failure-responsible agents but only 14.2% in pinpointing failure steps.
arXiv Detail & Related papers (2025-04-30T23:09:44Z)
Iterative Trajectory Exploration for Multimodal Agents [69.32855772335624]
We propose an online self-exploration method for multimodal agents, namely SPORT. SPORT operates through four iterative components: task synthesis, step sampling, step verification, and preference tuning. Evaluation in the GTA and GAIA benchmarks show that the SPORT Agent achieves 6.41% and 3.64% improvements.
arXiv Detail & Related papers (2025-04-30T12:01:27Z)
R2E-Gym: Procedural Environments and Hybrid Verifiers for Scaling Open-Weights SWE Agents [32.06393076572057]
AgentGym is the largest procedurally-curated executable gym environment for training real-world SWE-agents.<n>It is powered by two main contributions: SYNGEN, a synthetic data curation recipe, and Hybrid Test-time Scaling.<n>Overall, our approach achieves 51% on the SWE-Bench Verified benchmark, reflecting a new state-of-the-art for open-weight SWE-agents.
arXiv Detail & Related papers (2025-04-09T17:55:19Z)
LogiAgent: Automated Logical Testing for REST Systems with LLM-Based Multi-Agents [24.234475859016396]
LogiAgent is a novel approach for logical testing of REST systems.<n>It incorporates logical oracles that assess responses based on business logic.<n>It basically excels in detecting server crashes and achieves superior test coverage compared to four state-of-the-art REST API testing tools.
arXiv Detail & Related papers (2025-03-19T10:24:16Z)
MALT: Improving Reasoning with Multi-Agent LLM Training [66.9481561915524]
MALT (Multi-Agent LLM Training) is a novel post-training strategy that divides the reasoning process into generation, verification, and refinement steps.<n>On MATH, GSM8K, and CSQA, MALT surpasses the same baseline LLM with a relative improvement of 15.66%, 7.42%, and 9.40% respectively.
arXiv Detail & Related papers (2024-12-02T19:30:36Z)
A Comparative Study on Reasoning Patterns of OpenAI's o1 Model [69.08287909042421]
We show that OpenAI's o1 model has achieved the best performance on most datasets. We also provide a detailed analysis on several reasoning benchmarks.
arXiv Detail & Related papers (2024-10-17T15:09:03Z)
ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery [23.773528748933934]
We present ScienceAgentBench, a new benchmark for evaluating language agents for data-driven scientific discovery. We extract 102 tasks from 44 peer-reviewed publications in four disciplines and engage nine subject matter experts to validate them. Using ScienceAgentBench, we evaluate five open-weight and proprietary LLMs, each with three frameworks: direct prompting, OpenHands CodeAct, and self-AI.
arXiv Detail & Related papers (2024-10-07T14:33:50Z)
On the Resilience of LLM-Based Multi-Agent Collaboration with Faulty Agents [58.79302663733703]
Large language model-based multi-agent systems have shown great abilities across various tasks due to the collaboration of expert agents.<n>The impact of clumsy or even malicious agents--those who frequently make errors in their tasks--on the overall performance of the system remains underexplored.<n>This paper investigates what is the resilience of various system structures under faulty agents on different downstream tasks.
arXiv Detail & Related papers (2024-08-02T03:25:20Z)
Evaluating Large Language Models in Detecting Test Smells [1.5691664836504473]
The presence of test smells can negatively impact the maintainability and reliability of software. This study aims to evaluate the capability of Large Language Models (LLMs) in automatically detecting test smells.
arXiv Detail & Related papers (2024-07-27T14:00:05Z)
SOEN-101: Code Generation by Emulating Software Process Models Using Large Language Model Agents [50.82665351100067]
FlowGen is a code generation framework that emulates software process models based on multiple Large Language Model (LLM) agents. We evaluate FlowGenScrum on four benchmarks: HumanEval, HumanEval-ET, MBPP, and MBPP-ET.
arXiv Detail & Related papers (2024-03-23T14:04:48Z)
Theoretically Achieving Continuous Representation of Oriented Bounding Boxes [64.15627958879053]
This paper endeavors to completely solve the issue of discontinuity in Oriented Bounding Box representation. We propose a novel representation method called Continuous OBB (COBB) which can be readily integrated into existing detectors. For fairness and transparency of experiments, we have developed a modularized benchmark based on the open-source deep learning framework Jittor's detection toolbox JDet for OOD evaluation.
arXiv Detail & Related papers (2024-02-29T09:27:40Z)
Towards Reliable AI Model Deployments: Multiple Input Mixup for Out-of-Distribution Detection [4.985768723667418]
We propose a novel and simple method to solve the Out-of-Distribution (OOD) detection problem. Our method can help improve the OOD detection performance with only single epoch fine-tuning. Our method does not require training the model from scratch and can be attached to the classifier simply.
arXiv Detail & Related papers (2023-12-24T15:31:51Z)
MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation [96.71370747681078]
We introduce MLAgentBench, a suite of 13 tasks ranging from improving model performance on CIFAR-10 to recent research problems like BabyLM. For each task, an agent can perform actions like reading/writing files, executing code, and inspecting outputs. We benchmark agents based on Claude v1.0, Claude v2.1, Claude v3 Opus, GPT-4, GPT-4-turbo, Gemini-Pro, and Mixtral and find that a Claude v3 Opus agent is the best in terms of success rate.
arXiv Detail & Related papers (2023-10-05T04:06:12Z)
Machine Learning-Based Test Smell Detection [17.957877801382413]
Test smells are symptoms of sub-optimal design choices adopted when developing test cases. We propose the design and experimentation of a novel test smell detection approach based on machine learning to detect four test smells.
arXiv Detail & Related papers (2022-08-16T07:33:15Z)
Disentangle Your Dense Object Detector [82.22771433419727]
Deep learning-based dense object detectors have achieved great success in the past few years and have been applied to numerous multimedia applications such as video understanding. However, the current training pipeline for dense detectors is compromised to lots of conjunctions that may not hold. We propose Disentangled Dense Object Detector (DDOD), in which simple and effective disentanglement mechanisms are designed and integrated into the current state-of-the-art detectors.
arXiv Detail & Related papers (2021-07-07T00:52:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.