Fugu-MT 論文翻訳(概要): Evaluating Software Process Models for Multi-Agent Class-Level Code Generation

論文の概要: Evaluating Software Process Models for Multi-Agent Class-Level Code Generation

arxiv url: http://arxiv.org/abs/2511.09794v1
Date: Fri, 14 Nov 2025 01:10:06 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-14 22:53:22.491895
Title: Evaluating Software Process Models for Multi-Agent Class-Level Code Generation
Title（参考訳）: マルチエージェントクラスレベルコード生成のためのソフトウェアプロセスモデルの評価
Authors: Wasique Islam Shafin, Md Nakhla Rafi, Zhenhao Li, Tse-Hsun Chen,
Abstract要約: 大規模言語モデル(LLM)は、ソフトウェア開発の自動化にますます使われています。本研究では,クラスレベルのコード生成のためのプロセス構造とロール形状のマルチエージェント特殊化について検討する。
参考スコア（独自算出の注目度）: 5.545076518491288
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modern software systems require code that is not only functional but also maintainable and well-structured. Although Large Language Models (LLMs) are increasingly used to automate software development, most studies focus on isolated, single-agent function-level generation. This work examines how process structure and role specialization shape multi-agent LLM workflows for class-level code generation. We simulate a Waterfall-style development cycle covering Requirement, Design, Implementation, and Testing using three LLMs (GPT-4o-mini, DeepSeek-Chat, and Claude-3.5-Haiku) on 100 Python tasks from the ClassEval benchmark. Our findings show that multi-agent workflows reorganize, rather than consistently enhance, model performance. Waterfall-style collaboration produces cleaner and more maintainable code but often reduces functional correctness (-37.8\% for GPT-4o-mini and -39.8\% for DeepSeek-Chat), with Claude-3.5-Haiku as a notable exception (+9.5\%). Importantly, process constraints shift failure characteristics: structural issues such as missing code decrease, while semantic and validation errors become more frequent. Among all stages, Testing exerts the strongest influence by improving verification coverage but also introducing new reasoning failures, whereas Requirement and Design have comparatively modest effects. Overall, this study provides empirical evidence that software process structure fundamentally alters how LLMs reason, collaborate, and fail, revealing inherent trade-offs between rigid workflow discipline and flexible problem-solving in multi-agent code generation.
Abstract（参考訳）: 現代のソフトウェアシステムには、機能的だけでなく、保守性があり、十分に構造化されたコードが必要である。大規模言語モデル(LLM)はソフトウェア開発の自動化にますます使われていますが、ほとんどの研究は独立した単一エージェント関数レベルの生成に重点を置いています。本研究では,クラスレベルのコード生成のためのプロセス構造と役割特化形状のマルチエージェントLLMワークフローについて検討する。クラスEvalベンチマークから,100のPythonタスク上で3つのLLM(GPT-4o-mini, DeepSeek-Chat, Claude-3.5-Haiku)を用いて,要求,設計,実装,テストをカバーするウォーターフォールスタイルの開発サイクルをシミュレートする。この結果から,マルチエージェントワークフローがモデル性能を継続的に向上するのではなく,再編成されることが示唆された。ウォーターフォールスタイルのコラボレーションはよりクリーンでメンテナンスしやすいコードを生成するが、機能的正しさ(GPT-4o-miniでは-37.8\%、DeepSeek-Chatでは-39.8\%)を減らし、Claude-3.5-Haikuを例外(+9.5\%)とする。重要なのは、プロセスの制約が障害特性を変えることだ。コードの欠落などの構造上の問題が少なくなり、セマンティックなエラーや検証エラーが頻繁に発生する。あらゆる段階において、テストは検証カバレッジを改善しながら、新たな推論失敗を導入することで、最も強い影響を与える一方、要求と設計は比較的穏やかな効果を持っている。全体として、本研究では、ソフトウェアプロセス構造がLCMの理性、協力、失敗の仕方を根本的に変えるという実証的な証拠を提供し、厳密なワークフローの規律とマルチエージェントコード生成における柔軟な問題解決との間の固有のトレードオフを明らかにする。

論文の概要: Evaluating Software Process Models for Multi-Agent Class-Level Code Generation

関連論文リスト