Fugu-MT 論文翻訳(概要): SmellBench: Evaluating LLM Agents on Architectural Code Smell Repair

論文の概要: SmellBench: Evaluating LLM Agents on Architectural Code Smell Repair

arxiv url: http://arxiv.org/abs/2605.07001v2
Date: Tue, 12 May 2026 16:50:41 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-13 18:21:06.819966
Title: SmellBench: Evaluating LLM Agents on Architectural Code Smell Repair
Title（参考訳）: SmellBench: アーキテクチャコードスメル修復におけるLLMエージェントの評価
Authors: Ion George Dinu, Marian Cristian Mihăescu, Traian Rebedea,
Abstract要約: アーキテクチャコードはソフトウェアの保守性を損なうため、手作業で修理するのにコストがかかる。本稿では,大規模言語モデルエージェントによる建築コードの臭いの修復に関する経験的評価について述べる。 SmellBenchは、嗅覚タイプ固有の最適化プロンプトを組み込んだタスクオーケストレーションフレームワークです。
参考スコア（独自算出の注目度）: 1.6922906233636834
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Architectural code smells erode software maintainability and are costly to repair manually, yet unlike localized bugs, they require cross-module reasoning about design intent that challenges both developers and automated tools. While large language model agents excel at bug fixing and code-level refactoring, their ability to repair architectural code smells remains unexplored. We present the first empirical evaluation of LLM agents on architectural code smell repair. We contribute SmellBench, a task orchestration framework that incorporates smell-type-specific optimized prompts and supports iterative multi-step execution, together with a scoring methodology that separately evaluates repair effectiveness, false positive identification, and net codebase impact. We evaluate 11 agent configurations from four model families (GPT, Claude, Gemini, Mistral) on 65 hard-severity architectural smells detected by PyExamine in the Python project scikit-learn, validated against expert judgments. Expert validation reveals that 63.1% of detected smells are false positives, while the best agent achieves a 47.7% resolution rate. Agents identify false positives with up to $κ= 0.94$ expert agreement, but repair aggressiveness and net codebase quality are inversely related: the most aggressive agent introduces 140 new smells. These findings expose a gap between current LLM capabilities in localized code transformations and the architectural understanding needed for cross-module refactoring. SmellBench provides reusable infrastructure for tracking progress on this underexplored dimension of automated software engineering. We release our code and data at https://doi.org/10.5281/zenodo.19247588.
Abstract（参考訳）: アーキテクチャコードはソフトウェアの保守性が損なわれ、手作業で修理するのにコストがかかるが、局所的なバグとは異なり、開発者と自動化ツールの両方に挑戦する設計意図に関するクロスモジュール推論を必要とする。大規模な言語モデルエージェントはバグ修正やコードレベルのリファクタリングに優れていますが、アーキテクチャコードの臭いを修復する能力はまだ解明されていません。建築用コードの臭い修復におけるLLM剤の試験的評価について述べる。 SmellBenchは、嗅覚タイプ固有の最適化プロンプトを組み込んだタスクオーケストレーションフレームワークで、繰り返し多段階実行をサポートするとともに、修復の有効性、偽陽性識別、ネットコードベースへの影響を個別に評価する評価手法を提供する。我々は,PythonプロジェクトScikit-learnでPyExamineが検出した65の重度建築臭いに対して,4つのモデルファミリー(GPT, Claude, Gemini, Mistral)から11のエージェント構成を評価する。専門家による検証では、検出された嗅覚の63.1%が偽陽性であり、最良の剤は47.7%の分解率である。エージェントは、最大$κ= 0.94$の専門家合意で偽陽性を識別するが、アグレッシブな修復とネットコードベースの品質は逆関係である。これらの結果は、ローカライズされたコード変換における現在のLLM機能と、クロスモジュールリファクタリングに必要なアーキテクチャ的理解のギャップを明らかにしている。 SmellBenchは、自動化されたソフトウェアエンジニアリングのこの未調査次元の進捗を追跡するために、再利用可能なインフラストラクチャを提供する。コードとデータはhttps://doi.org/10.5281/zenodo.19247588で公開しています。

論文の概要: SmellBench: Evaluating LLM Agents on Architectural Code Smell Repair

関連論文リスト