Related papers: Beyond Strict Rules: Assessing the Effectiveness of Large Language Models for Code Smell Detection

Beyond Strict Rules: Assessing the Effectiveness of Large Language Models for Code Smell Detection

URL: http://arxiv.org/abs/2601.09873v1
Date: Wed, 14 Jan 2026 21:08:35 GMT
Title: Beyond Strict Rules: Assessing the Effectiveness of Large Language Models for Code Smell Detection
Authors: Saymon Souza, Amanda Santana, Eduardo Figueiredo, Igor Muzetti, João Eduardo Montandon, Lionel Briand,
Abstract summary: Code smells are symptoms of potential code quality problems that may affect software maintainability.<n>This paper evaluates the effectiveness of four large language models (LLMs) for detecting nine code smells across 30 Java projects.
Score: 0.5249836059995157
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Code smells are symptoms of potential code quality problems that may affect software maintainability, thus increasing development costs and impacting software reliability. Large language models (LLMs) have shown remarkable capabilities for supporting various software engineering activities, but their use for detecting code smells remains underexplored. However, unlike the rigid rules of static analysis tools, LLMs can support flexible and adaptable detection strategies tailored to the unique properties of code smells. This paper evaluates the effectiveness of four LLMs -- DeepSeek-R1, GPT-5 mini, Llama-3.3, and Qwen2.5-Code -- for detecting nine code smells across 30 Java projects. For the empirical evaluation, we created a ground-truth dataset by asking 76 developers to manually inspect 268 code-smell candidates. Our results indicate that LLMs perform strongly for structurally straightforward smells, such as Large Class and Long Method. However, we also observed that different LLMs and tools fare better for distinct code smells. We then propose and evaluate a detection strategy that combines LLMs and static analysis tools. The proposed strategy outperforms LLMs and tools in five out of nine code smells in terms of F1-Score. However, it also generates more false positives for complex smells. Therefore, we conclude that the optimal strategy depends on whether Recall or Precision is the main priority for code smell detection.

Related papers

Evaluating and Achieving Controllable Code Completion in Code LLM [89.64782747840225]
We present the first instruction-guided code completion benchmark, Controllable Code Completion Benchmark (C3-Bench)<n>We reveal substantial gaps in instruction-following capabilities between open-source and advanced proprietary models during code completion tasks.<n>The resulting model, Qwen2.5-Coder-C3, achieves state-of-the-art performance on C3-Bench.
arXiv Detail & Related papers (2026-01-22T11:40:04Z)
Specification and Detection of LLM Code Smells [3.53563608080816]
We introduce the concept of LLM code smells and formalize five problematic coding practices related to LLM inference in software systems.<n>We extend the detection tool SpecDetect4AI to cover the newly defined LLM code smells and use it to validate their prevalence in a dataset of 200 open-source LLM systems.
arXiv Detail & Related papers (2025-12-19T19:24:56Z)
A Causal Perspective on Measuring, Explaining and Mitigating Smells in LLM-Generated Code [49.09545217453401]
Propensity Smelly Score (PSC) is a metric that estimates the likelihood of generating particular smell types.<n>We identify how generation strategy, model size, model architecture and prompt formulation shape the structural properties of generated code.<n> PSC helps developers interpret model behavior and assess code quality, providing evidence that smell propensity signals can support human judgement.
arXiv Detail & Related papers (2025-11-19T19:18:28Z)
Investigating The Smells of LLM Generated Code [2.9232837969697965]
Large Language Models (LLMs) are increasingly being used to generate program code.<n>This study proposes a scenario-based method of evaluating the quality of LLM-generated code.
arXiv Detail & Related papers (2025-10-03T14:09:55Z)
Ensembling Large Language Models for Code Vulnerability Detection: An Empirical Evaluation [69.8237598448941]
This study investigates the potential of ensemble learning to enhance the performance of Large Language Models (LLMs) in source code vulnerability detection.<n>We propose Dynamic Gated Stacking (DGS), a Stacking variant tailored for vulnerability detection.
arXiv Detail & Related papers (2025-09-16T03:48:22Z)
Clean Code, Better Models: Enhancing LLM Performance with Smell-Cleaned Dataset [13.23492570818459]
This study takes the first systematic research to assess and improve dataset quality in terms of code smells.<n>We propose an LLM-based code smell cleaning tool, named SmellCC, which automatically removes code smells.
arXiv Detail & Related papers (2025-08-16T07:40:58Z)
Teaching LLM to Reason: Reinforcement Learning from Algorithmic Problems without Code [76.80306464249217]
We propose TeaR, which aims at teaching LLMs to reason better.<n>TeaR leverages careful data curation and reinforcement learning to guide models in discovering optimal reasoning paths through code-related tasks.<n>We conduct extensive experiments using two base models and three long-CoT distillation models, with model sizes ranging from 1.5 billion to 32 billion parameters, and across 17 benchmarks spanning Math, Knowledge, Code, and Logical Reasoning.
arXiv Detail & Related papers (2025-07-10T07:34:05Z)
How Propense Are Large Language Models at Producing Code Smells? A Benchmarking Study [45.126233498200534]
We introduce CodeSmellEval, a benchmark designed to evaluate the propensity of Large Language Models for generating code smells.<n>Our benchmark includes a novel metric: Propensity Smelly Score (PSC), and a curated dataset of method-level code smells: CodeSmellData.<n>To demonstrate the use of CodeSmellEval, we conducted a case study with two state-of-the-art LLMs, CodeLlama and Mistral.
arXiv Detail & Related papers (2024-12-25T21:56:35Z)
A Comprehensive Evaluation of Parameter-Efficient Fine-Tuning on Code Smell Detection [11.9757082688031]
Code smells are suboptimal coding practices that negatively impact the quality of software systems.<n>Existing detection methods, relying on Codes or Machine Learning (ML) and Deep Learning (DL) techniques, often face limitations such as unsatisfactory performance.<n>This study evaluates state-of-the-art PEFT methods on both Small (SLMs) and Large Language Models (LLMs) for detecting four types of code smells.
arXiv Detail & Related papers (2024-12-18T12:48:36Z)
OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models [76.59316249991657]
Large language models (LLMs) for code have become indispensable in various domains, including code generation, reasoning tasks and agent systems.<n>While open-access code LLMs are increasingly approaching the performance levels of proprietary models, high-quality code LLMs remain limited.<n>We introduce OpenCoder, a top-tier code LLM that not only achieves performance comparable to leading models but also serves as an "open cookbook" for the research community.
arXiv Detail & Related papers (2024-11-07T17:47:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.