Fugu-MT 論文翻訳(概要): Investigating The Smells of LLM Generated Code

論文の概要: Investigating The Smells of LLM Generated Code

arxiv url: http://arxiv.org/abs/2510.03029v1
Date: Fri, 03 Oct 2025 14:09:55 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-06 16:35:52.418448
Title: Investigating The Smells of LLM Generated Code
Title（参考訳）: LLM生成コードの臭いを調査する
Authors: Debalina Ghosh Paul, Hong Zhu, Ian Bayley,
Abstract要約: 大きな言語モデル(LLM)は、プログラムコードを生成するためにますます使われています。本研究では,LLM生成コードの品質を評価するシナリオベース手法を提案する。
参考スコア（独自算出の注目度）: 2.9232837969697965
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Context: Large Language Models (LLMs) are increasingly being used to generate program code. Much research has been reported on the functional correctness of generated code, but there is far less on code quality. Objectives: In this study, we propose a scenario-based method of evaluating the quality of LLM-generated code to identify the weakest scenarios in which the quality of LLM generated code should be improved. Methods: The method measures code smells, an important indicator of code quality, and compares them with a baseline formed from reference solutions of professionally written code. The test dataset is divided into various subsets according to the topics of the code and complexity of the coding tasks to represent different scenarios of using LLMs for code generation. We will also present an automated test system for this purpose and report experiments with the Java programs generated in response to prompts given to four state-of-the-art LLMs: Gemini Pro, ChatGPT, Codex, and Falcon. Results: We find that LLM-generated code has a higher incidence of code smells compared to reference solutions. Falcon performed the least badly, with a smell increase of 42.28%, followed by Gemini Pro (62.07%), ChatGPT (65.05%) and finally Codex (84.97%). The average smell increase across all LLMs was 63.34%, comprising 73.35% for implementation smells and 21.42% for design smells. We also found that the increase in code smells is greater for more complex coding tasks and for more advanced topics, such as those involving object-orientated concepts. Conclusion: In terms of code smells, LLM's performances on various coding task complexities and topics are highly correlated to the quality of human written code in the corresponding scenarios. However, the quality of LLM generated code is noticeably poorer than human written code.
Abstract（参考訳）: コンテキスト: プログラムコードを生成するために、LLM(Large Language Models)がますます使われています。生成されたコードの機能的正確性に関する多くの研究が報告されているが、コード品質については、はるかに少ない。目的: 本研究では, LLM生成コードの品質を改善するために, LLM生成コードの品質を評価するシナリオベース手法を提案する。方法: この方法は、コードの品質を示す重要な指標であるコードの臭いを測定し、専門家が書いたコードの参照ソリューションから作られたベースラインと比較する。テストデータセットは、コード生成にLLMを使用するさまざまなシナリオを表現するために、コードのトピックとコーディングタスクの複雑さに応じて、さまざまなサブセットに分割される。また、この目的のために自動テストシステムを提案し、Gemini Pro、ChatGPT、Codex、Falconの4つの最先端LLMへのプロンプトに応じて生成されたJavaプログラムの実験を報告する。結果: LLM生成したコードは, 参照解に比べてコードの臭いの発生頻度が高いことがわかった。ファルコンの臭気は42.28%増加し、続いてジェミニ・プロ(62.07%)、チャットGPT(65.05%)、コーデックス(84.97%)が続いた。全LLMの平均臭気増加率は63.34%であり、実装臭気は73.35%、設計臭気は21.42%であった。また、コードの臭いの増加は、より複雑なコーディングタスクや、オブジェクト指向の概念を含むようなより高度なトピックにとってより大きいことがわかりました。結論: コードの臭いに関しては,コーディングタスクの複雑度やトピックに関するLCMのパフォーマンスは,対応するシナリオにおける人間の記述コードの品質と強く相関している。しかし、LLM生成コードの品質は、人間の書いたコードよりも著しく劣っている。

論文の概要: Investigating The Smells of LLM Generated Code

関連論文リスト