Fugu-MT 論文翻訳(概要): Early Diagnosis of Wasted Computation in Multi-Agent LLM Systems via Failure-Aware Observability

論文の概要: Early Diagnosis of Wasted Computation in Multi-Agent LLM Systems via Failure-Aware Observability

arxiv url: http://arxiv.org/abs/2606.01365v1
Date: Sun, 31 May 2026 17:50:11 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-02 21:34:29.661983
Title: Early Diagnosis of Wasted Computation in Multi-Agent LLM Systems via Failure-Aware Observability
Title（参考訳）: フェール・アウェア・オブザーバビリティによるマルチエージェントLDMシステムにおける廃棄物計算の早期診断
Authors: Xianyou Li, Weiran Yan, Yichao Wu, Penghao Liang, Mengwei Yuan, Jianan Liu, Jing Yang,
Abstract要約: 本稿では,マルチエージェントLSMトレースにおける無駄な計算を診断するための,故障を考慮した可観測性フレームワークを提案する。このフレームワークを3エージェントの質問応答システムでインスタンス化し、165のGAIA検証トレース上で同一の実行上限で評価する。
参考スコア（独自算出の注目度）: 8.036549927091286
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Tool-using multi-agent large language model (LLM) systems spend computation through model tokens, tool calls, retries, and code execution before producing an answer. When a run fails, final-answer evaluation reveals the endpoint but usually not the point at which the trajectory stopped making recoverable progress. This paper introduces a failure-aware observability framework for diagnosing wasted computation in multi-agent LLM traces. The framework maps recurring failure modes to online trace signals, including tool reliability, execution recovery, orchestration loops, evidence availability, information change, and budget pressure. We instantiate the framework in a three- agent question-answering system and evaluate it on 165 GAIA validation traces under identical execution caps. Operational failures remain common: 22/53 level-1 runs, 33/86 level-2 runs, and 12/26 level-3 runs fail to produce a usable final answer. The traces expose different mechanisms behind these outcomes, including insufficient evidence, repeated-action loops, max-step termination, tool-failure streaks, and execution calls that succeed without useful output. Mean token use rises from 8,152 tokens at level 1 to 16,389 tokens at level 3, while evidence availability and sentence-level support diverge. A cached 10-trace LLM-judge grounding audit shows that cheap online signals and deeper semantic metrics capture complementary layers of failure. The results position failure-aware observability as a diagnostic layer between raw execution logs and final-answer accuracy.
Abstract（参考訳）: ツールを使用するマルチエージェント大規模言語モデル(LLM)システムは、答を生成する前に、モデルトークン、ツールコール、リトライ、コード実行を通じて計算に費やす。実行が失敗すると、最終回答評価はエンドポイントを明らかにするが、通常は軌道が停止した時点では回復できない。本稿では,マルチエージェントLSMトレースにおける無駄な計算を診断するための,故障を考慮した可観測性フレームワークを提案する。このフレームワークは、繰り返し発生する障害モードを、ツールの信頼性、実行回復、オーケストレーションループ、エビデンス可用性、情報変更、予算プレッシャーを含むオンライントレース信号にマッピングする。このフレームワークを3エージェントの質問応答システムでインスタンス化し、165のGAIA検証トレース上で同一の実行上限で評価する。 22/53のレベル1ラン、33/86のレベル2ラン、12/26のレベル3ランは使用可能な最終回答を得られない。これらのトレースは、不十分なエビデンス、繰り返しアクションループ、最大ステップ終了、ツール障害ストリーク、有用なアウトプットなしで成功する実行呼び出しなど、これらの結果の背後にあるさまざまなメカニズムを明らかにする。平均トークンの使用量は、レベル1の8,152トークンからレベル3の16,389トークンまで増加し、エビデンス可用性と文レベルサポートは多様化している。キャッシュされた10トレースのLDM-judgeグラウンド監査は、安価なオンライン信号と深いセマンティックメトリクスが相補的な障害層をキャプチャしていることを示している。その結果、生の実行ログと最終回答精度の間の診断層として、障害対応オブザーバビリティを位置づけた。

論文の概要: Early Diagnosis of Wasted Computation in Multi-Agent LLM Systems via Failure-Aware Observability

関連論文リスト