Fugu-MT 論文翻訳(概要): An Empirical Study on Failures in Automated Issue Solving

論文の概要: An Empirical Study on Failures in Automated Issue Solving

arxiv url: http://arxiv.org/abs/2509.13941v1
Date: Wed, 17 Sep 2025 13:07:52 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-18 18:41:50.847911
Title: An Empirical Study on Failures in Automated Issue Solving
Title（参考訳）: 自動問題解決における失敗に関する実証的研究
Authors: Simiao Liu, Fang Liu, Liehao Li, Xin Tan, Yinghao Zhu, Xiaoli Lian, Li Zhang,
Abstract要約: 我々は,SWE-Bench-Verifiedの自動問題解決タスクにおいて,パイプラインベースとエージェントアーキテクチャの両方にまたがる3つのSOTAツールの性能と効率を分析する。ハイレベルなパフォーマンス指標から根本原因分析に移行するために,150件の障害事例の体系的手動分析を行った。その結果、2つのアーキテクチャパラダイムの間には明確な失敗の指紋が明らかとなり、ほとんどのエージェント的失敗は、欠陥のある推論と認知的デッドロックに起因する。
参考スコア（独自算出の注目度）: 12.571536148821144
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Automated issue solving seeks to autonomously identify and repair defective code snippets across an entire codebase. SWE-Bench has emerged as the most widely adopted benchmark for evaluating progress in this area. While LLM-based agentic tools show great promise, they still fail on a substantial portion of tasks. Moreover, current evaluations primarily report aggregate issue-solving rates, which obscure the underlying causes of success and failure, making it challenging to diagnose model weaknesses or guide targeted improvements. To bridge this gap, we first analyze the performance and efficiency of three SOTA tools, spanning both pipeline-based and agentic architectures, in automated issue solving tasks of SWE-Bench-Verified under varying task characteristics. Furthermore, to move from high-level performance metrics to underlying cause analysis, we conducted a systematic manual analysis of 150 failed instances. From this analysis, we developed a comprehensive taxonomy of failure modes comprising 3 primary phases, 9 main categories, and 25 fine-grained subcategories. Then we systematically analyze the distribution of the identified failure modes, the results reveal distinct failure fingerprints between the two architectural paradigms, with the majority of agentic failures stemming from flawed reasoning and cognitive deadlocks. Motivated by these insights, we propose a collaborative Expert-Executor framework. It introduces a supervisory Expert agent tasked with providing strategic oversight and course-correction for a primary Executor agent. This architecture is designed to correct flawed reasoning and break the cognitive deadlocks that frequently lead to failure. Experiments show that our framework solves 22.2% of previously intractable issues for a leading single agent. These findings pave the way for building more robust agents through diagnostic evaluation and collaborative design.
Abstract（参考訳）: 自動問題解決はコードベース全体にわたって欠陥のあるコードスニペットを自律的に識別し、修復することを目指している。 SWE-Benchはこの分野の進歩を評価するための最も広く採用されているベンチマークとして登場した。 LLMベースのエージェントツールは大きな可能性を秘めているが、それでもかなりのタスクで失敗している。さらに、現在の評価では、主に、成功と失敗の根本原因を曖昧にし、モデルの弱点の診断や目標とする改善のガイドを困難にしている、総合的な問題解決率を報告している。このギャップを埋めるために、我々はまず、SWE-Bench-Verifiedの自動問題解決タスクにおいて、パイプラインベースとエージェントアーキテクチャの両方にまたがる3つのSOTAツールの性能と効率を分析する。さらに、ハイレベルなパフォーマンス指標から根本原因分析に移行するために、150の障害インスタンスを体系的に手動で分析した。そこで本研究では,3つの一次段階,9つの主要カテゴリ,25の微粒なサブカテゴリからなる障害モードの包括的分類法を開発した。そして, 識別された障害モードの分布を系統的に解析し, その結果から2つのアーキテクチャパラダイムの相違点が明らかとなり, エージェント的障害の大部分は, 欠点のある推論と認知的デッドロックに起因する。これらの知見に感銘を受けて,我々は,協調的なエキスパート・エクゼクタ・フレームワークを提案する。主要な実行エージェントに対して戦略的監視とコース補正を行うための監督専門家エージェントを導入する。このアーキテクチャは、欠陥のある推論を修正し、しばしば失敗につながる認知的デッドロックを壊すように設計されています。実験によると、我々のフレームワークは、先進的な単一エージェントに対して、これまで難解だった問題の22.2%を解決している。これらの知見は、診断評価と協調設計を通じて、より堅牢なエージェントを構築するための道を開いた。

論文の概要: An Empirical Study on Failures in Automated Issue Solving

関連論文リスト