Fugu-MT 論文翻訳(概要): DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems

論文の概要: DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems

arxiv url: http://arxiv.org/abs/2512.06749v2
Date: Tue, 09 Dec 2025 13:22:36 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-10 14:12:23.005093
Title: DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems
Title（参考訳）: DoVer: LLMマルチエージェントシステムのためのインターベンション駆動自動デバッグ
Authors: Ming Ma, Jue Zhang, Fangkai Yang, Yu Kang, Qingwei Lin, Tianming Yang, Saravan Rajmohan, Dongmei Zhang,
Abstract要約: DoVerは、大規模言語モデル(LLM)ベースのマルチエージェントシステムのための介入駆動デバッグフレームワークである。ターゲットの介入を通じて、アクティブな検証によって仮説生成を増強する。 DoVerは失敗試験の18～28%を成功させ、最大16%のマイルストーンを達成し、失敗仮説の30～60%を検証または否定する。
参考スコア（独自算出の注目度）: 48.971606069204825
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language model (LLM)-based multi-agent systems are challenging to debug because failures often arise from long, branching interaction traces. The prevailing practice is to leverage LLMs for log-based failure localization, attributing errors to a specific agent and step. However, this paradigm has two key limitations: (i) log-only debugging lacks validation, producing untested hypotheses, and (ii) single-step or single-agent attribution is often ill-posed, as we find that multiple distinct interventions can independently repair the failed task. To address the first limitation, we introduce DoVer, an intervention-driven debugging framework, which augments hypothesis generation with active verification through targeted interventions (e.g., editing messages, altering plans). For the second limitation, rather than evaluating on attribution accuracy, we focus on measuring whether the system resolves the failure or makes quantifiable progress toward task success, reflecting a more outcome-oriented view of debugging. Within the Magnetic-One agent framework, on the datasets derived from GAIA and AssistantBench, DoVer flips 18-28% of failed trials into successes, achieves up to 16% milestone progress, and validates or refutes 30-60% of failure hypotheses. DoVer also performs effectively on a different dataset (GSMPlus) and agent framework (AG2), where it recovers 49% of failed trials. These results highlight intervention as a practical mechanism for improving reliability in agentic systems and open opportunities for more robust, scalable debugging methods for LLM-based multi-agent systems. Project website and code will be available at https://aka.ms/DoVer.
Abstract（参考訳）: 大規模な言語モデル(LLM)ベースのマルチエージェントシステムは、長い分岐相互作用トレースから障害が発生するため、デバッグが困難である。一般的なプラクティスは、ログベースの障害ローカライゼーションにLLMを活用することであり、エラーを特定のエージェントとステップに関連付ける。しかし、このパラダイムには2つの重要な制限がある。 (i)ログのみのデバッグは検証に欠け、未検証の仮説を生成し、 (ii)単一段階または単一段階の帰属は、複数の異なる介入が独立して失敗したタスクを修復できることから、しばしば悪用される。最初の制限に対処するため、DoVerという介入駆動デバッグフレームワークを導入し、ターゲットとした介入(例えば、メッセージの編集、計画の変更)を通じて、仮説生成をアクティブな検証で強化する。第2の制限は、帰属精度を評価するのではなく、システムが障害を解消するか、あるいはタスクの成功に向けて定量的に進行するかを計測することに集中し、デバッグのより結果指向の視点を反映する。 Magnetic-Oneエージェントフレームワーク内では、GAIAとAssistantBenchから派生したデータセットに基づいて、DoVerは失敗するトライアルの18～28%を成功させ、16%のマイルストーンを達成し、失敗仮説の30～60%を検証または否定する。 DoVerはまた、異なるデータセット(GSMPlus)とエージェントフレームワーク(AG2)で効果的に動作し、失敗したトライアルの49%を回復する。これらの結果から,エージェントシステムの信頼性向上のための実践的なメカニズムとして介入が強調され,LLMベースのマルチエージェントシステムのためのより堅牢でスケーラブルなデバッグ手法の機会が開かれた。プロジェクトのWebサイトとコードは、https://aka.ms/DoVer.orgから入手できる。

論文の概要: DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems

関連論文リスト