Fugu-MT 論文翻訳(概要): Beyond Strict Rules: Assessing the Effectiveness of Large Language Models for Code Smell Detection

論文の概要: Beyond Strict Rules: Assessing the Effectiveness of Large Language Models for Code Smell Detection

arxiv url: http://arxiv.org/abs/2601.09873v1
Date: Wed, 14 Jan 2026 21:08:35 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-16 19:43:18.901011
Title: Beyond Strict Rules: Assessing the Effectiveness of Large Language Models for Code Smell Detection
Title（参考訳）: 厳密なルールを超えて:コードスメル検出のための大規模言語モデルの有効性を評価する
Authors: Saymon Souza, Amanda Santana, Eduardo Figueiredo, Igor Muzetti, João Eduardo Montandon, Lionel Briand,
Abstract要約: コードの臭いは、ソフトウェアの保守性に影響を与える可能性のあるコード品質の問題の兆候である。本稿では,30のJavaプロジェクトにおいて,9つのコードの臭いを検出するための4つの大規模言語モデル(LLM)の有効性を評価する。
参考スコア（独自算出の注目度）: 0.5249836059995157
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Code smells are symptoms of potential code quality problems that may affect software maintainability, thus increasing development costs and impacting software reliability. Large language models (LLMs) have shown remarkable capabilities for supporting various software engineering activities, but their use for detecting code smells remains underexplored. However, unlike the rigid rules of static analysis tools, LLMs can support flexible and adaptable detection strategies tailored to the unique properties of code smells. This paper evaluates the effectiveness of four LLMs -- DeepSeek-R1, GPT-5 mini, Llama-3.3, and Qwen2.5-Code -- for detecting nine code smells across 30 Java projects. For the empirical evaluation, we created a ground-truth dataset by asking 76 developers to manually inspect 268 code-smell candidates. Our results indicate that LLMs perform strongly for structurally straightforward smells, such as Large Class and Long Method. However, we also observed that different LLMs and tools fare better for distinct code smells. We then propose and evaluate a detection strategy that combines LLMs and static analysis tools. The proposed strategy outperforms LLMs and tools in five out of nine code smells in terms of F1-Score. However, it also generates more false positives for complex smells. Therefore, we conclude that the optimal strategy depends on whether Recall or Precision is the main priority for code smell detection.
Abstract（参考訳）: コードの臭いは、ソフトウェアの保守性に影響を与える可能性のあるコード品質の問題の兆候であり、それによって開発コストが増加し、ソフトウェアの信頼性に影響を及ぼす。大規模言語モデル(LLM)は、様々なソフトウェアエンジニアリング活動をサポートする際、顕著な能力を示しているが、コードの臭いを検出するためにの使用は、まだ探索されていない。しかし、静的解析ツールの厳格なルールとは異なり、LLMはコードの臭いの特性に合わせて柔軟で適応可能な検出戦略をサポートすることができる。本稿では、30のJavaプロジェクトで9つのコードの臭いを検出するために、4つのLCM(DeepSeek-R1, GPT-5 mini, Llama-3.3, Qwen2.5-Code)の有効性を評価する。実験的な評価のために,76人の開発者が手動で268のコードスメル候補を検査するように依頼し,地中真実のデータセットを作成しました。以上の結果から,LLMはLarge ClassやLong Methodのような構造的に分かりやすい匂いに対して強く作用することが示唆された。しかし、異なるLLMやツールがコードの臭いを嗅ぐのに優れていることもわかりました。次に,LLMと静的解析ツールを組み合わせた検出戦略を提案し,評価する。提案された戦略は、F1-Scoreの点で、9つのコードの臭いのうち5つでLLMとツールを上回ります。しかし、複雑な臭いに対してさらに偽陽性を生じさせる。したがって,コードの臭い検出において,リコールや精度が最優先事項であるかどうかによって最適な戦略が決定される。

論文の概要: Beyond Strict Rules: Assessing the Effectiveness of Large Language Models for Code Smell Detection

関連論文リスト