Fugu-MT 論文翻訳(概要): CUJBench: Benchmarking LLM-Agent on Cross-Modal Failure Diagnosis from Browser to Backend

論文の概要: CUJBench: Benchmarking LLM-Agent on Cross-Modal Failure Diagnosis from Browser to Backend

arxiv url: http://arxiv.org/abs/2604.23455v2
Date: Fri, 01 May 2026 07:18:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-04 13:37:10.820981
Title: CUJBench: Benchmarking LLM-Agent on Cross-Modal Failure Diagnosis from Browser to Backend
Title（参考訳）: CUJBench: LLM-Agentベンチマークによるブラウザからバックエンドへのクロスモーダル障害診断
Authors: Haoming Meng,
Abstract要約: 診断フレーミングにおけるブラウザ可視性障害証拠とバックエンド可観測性を組み合わせた最初のベンチマークであるCUJBenchを提案する。このベンチマークでは、全体的な精度は19.7%、天井は52%、飽和度よりかなり低い。
参考スコア（独自算出の注目度）: 2.9612444540570113
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Automated failure diagnosis requires correlating browser-visible symptoms with backend observability signals, yet existing benchmarks do not evaluate this cross-modal reasoning task. Constructing one is non-trivial: multi-modal failure scenarios are costly to annotate, and live-environment capture introduces stochasticity that makes cross-run agent comparison unreliable. We present CUJBench, to our knowledge, the first benchmark to combine browser-visible failure evidence with backend observability in a diagnostic framing. CUJBench addresses annotation cost through an LLM-assisted generation pipeline with a multi-agent review loop and a three-layer annotation scheme, producing 87 labeled scenarios across five fault families, and ensures reproducibility by packaging each failure as a deterministic multi-modal snapshot with a fixed tool interface. Evaluating six frontier models under retrieval, browser-only, and full-toolset baselines, the benchmark yields an overall accuracy of 19.7% with a ceiling of 52%, well below saturation. Contrary to expectation, browser-only agents outperform full-toolset agents in aggregate, with expanded evidence access inducing unfocused exploration rather than improved synthesis. Trajectory analysis identifies cross-modal synthesis as the primary bottleneck: agents retrieve the decisive evidence but fail to attribute it correctly - a structural limitation uniform across all six models that model scale and richer tool access alone cannot resolve.
Abstract（参考訳）: 自動障害診断には、ブラウザ可視症状とバックエンドの可観測性信号との関連性が必要であるが、既存のベンチマークでは、このクロスモーダル推論タスクを評価していない。マルチモーダル障害シナリオはアノテートするのにコストがかかり、ライブ環境のキャプチャは、クロスランエージェントの比較を信頼性の低いものにする確率を導入します。 CUJBenchは、診断フレーミングにおけるブラウザ可視の障害証拠とバックエンドの可観測性を組み合わせた最初のベンチマークである。 CUJBenchは、マルチエージェントレビューループと3層アノテーションスキームを備えたLLM支援ジェネレーションパイプラインを通じてアノテーションコストに対処し、5つのフォールトファミリに87のラベル付きシナリオを生成し、各障害を決定論的マルチモーダルスナップショットとして固定ツールインターフェースでパッケージすることで再現性を確保する。検索中の6つのフロンティアモデル、ブラウザのみ、フルツールセットのベースラインを評価すると、ベンチマーク全体の精度は19.7%、天井は52%、飽和度よりかなり低い。期待に反して、ブラウザのみのエージェントは完全なツールセットエージェントよりも優れており、合成を改善するのではなく、非集中的な探索を誘発するエビデンスへのアクセスが拡大している。エージェントは決定的な証拠を回収するが、正しく属性付けできない - スケールをモデル化し、よりリッチなツールアクセスだけでは解決できない6つのモデルすべてに、構造的な制限が一様である。

論文の概要: CUJBench: Benchmarking LLM-Agent on Cross-Modal Failure Diagnosis from Browser to Backend

関連論文リスト