Fugu-MT 論文翻訳(概要): Benchmarking and Studying the LLM-based Agent System in End-to-End Software Development

論文の概要: Benchmarking and Studying the LLM-based Agent System in End-to-End Software Development

arxiv url: http://arxiv.org/abs/2511.04064v1
Date: Thu, 06 Nov 2025 05:10:04 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-07 20:17:53.306762
Title: Benchmarking and Studying the LLM-based Agent System in End-to-End Software Development
Title（参考訳）: エンド・ツー・エンドソフトウェア開発におけるLCMエージェントシステムのベンチマークと検討
Authors: Zhengran Zeng, Yixin Li, Rui Xie, Wei Ye, Shikun Zhang,
Abstract要約: エンドツーエンドソフトウェア開発のためのLLMベースの自律エージェントの開発は、ソフトウェア工学における重要なパラダイムシフトである。この作業はコミュニティに、より現実的なベンチマーク、包括的な評価フレームワーク、そしてソフトウェア開発エージェントの現在の能力とコア課題に対する重要な洞察を提供する。
参考スコア（独自算出の注目度）: 33.01897134024342
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The development of LLM-based autonomous agents for end-to-end software development represents a significant paradigm shift in software engineering. However, the scientific evaluation of these systems is hampered by significant challenges, including overly simplistic benchmarks and the difficulty of conducting fair comparisons between different agent architectures due to confounding implementation variables. To address these limitations, we first construct a challenging and dynamically curated E2EDevBench to simulate realistic development scenarios. Second, we propose a hybrid evaluation framework that combines test-case-based functional assessment with fine-grained, LLM-based requirement verification. Using this framework, we conduct a controlled empirical study on three representative agent architectures implemented upon a unified foundation to isolate the impact of workflow design. Our findings reveal that state-of-the-art agents can fulfill approximately 50\% of requirements on \bench{}, but their success is critically dependent on the architectural strategy for task decomposition and collaboration. Furthermore, our analysis indicates that the primary bottleneck is the omission of requirements and inadequate self-verification. This work provides the community with a more realistic benchmark, a comprehensive evaluation framework, and crucial insights into the current capabilities and core challenges of software development agents, guiding future research toward enhancing requirement comprehension and planning.
Abstract（参考訳）: エンドツーエンドソフトウェア開発のためのLLMベースの自律エージェントの開発は、ソフトウェア工学における重要なパラダイムシフトである。しかし、これらのシステムの科学的評価は、過度に単純化されたベンチマークや、実装変数の相違による異なるエージェントアーキテクチャ間の公正な比較の難しさなど、重大な課題によって妨げられている。これらの制限に対処するために、我々はまず、現実的な開発シナリオをシミュレートするために、困難で動的にキュレートされたE2EDevBenchを構築します。第2に,テストケースに基づく機能評価と詳細なLCMに基づく要件検証を組み合わせたハイブリッド評価フレームワークを提案する。このフレームワークを用いて、ワークフロー設計の影響を分離する統合基盤上に実装された3つの代表エージェントアーキテクチャについて、制御された実証的研究を行う。以上の結果から,最先端のエージェントは \bench{} の要件の約50%を満たせるが,その成功はタスクの分解と協調のアーキテクチャ戦略に大きく依存していることがわかった。さらに,本分析は,要求の欠落と自己検証の不十分が主なボトルネックであることを示唆している。この作業はコミュニティに、より現実的なベンチマーク、包括的な評価フレームワーク、そしてソフトウェア開発エージェントの現在の能力とコア課題に対する重要な洞察を提供し、要件の理解と計画を強化するための将来の研究を導く。

論文の概要: Benchmarking and Studying the LLM-based Agent System in End-to-End Software Development

関連論文リスト