Fugu-MT 論文翻訳(概要): Stalled, Biased, and Confused: Uncovering Reasoning Failures in LLMs for Cloud-Based Root Cause Analysis

論文の概要: Stalled, Biased, and Confused: Uncovering Reasoning Failures in LLMs for Cloud-Based Root Cause Analysis

arxiv url: http://arxiv.org/abs/2601.22208v1
Date: Thu, 29 Jan 2026 18:23:26 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-02 18:28:15.001758
Title: Stalled, Biased, and Confused: Uncovering Reasoning Failures in LLMs for Cloud-Based Root Cause Analysis
Title（参考訳）: 安定, バイアス, 混乱: クラウドベース根本原因解析のためのLLMにおける推論障害の発見
Authors: Evelien Riddell, James Riddell, Gengyi Sun, Michał Antkiewicz, Krzysztof Czarnecki,
Abstract要約: LLMの推論動作を分離する実験的な評価手法を提案する。我々は16の共通RCA推論失敗の分類をラベル付きで作成し、アノテーションにLLM-as-a-Judgeを使用する。
参考スコア（独自算出の注目度）: 5.532586951580959
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Root cause analysis (RCA) is essential for diagnosing failures within complex software systems to ensure system reliability. The highly distributed and interdependent nature of modern cloud-based systems often complicates RCA efforts, particularly for multi-hop fault propagation, where symptoms appear far from their true causes. Recent advancements in Large Language Models (LLMs) present new opportunities to enhance automated RCA. However, their practical value for RCA depends on the fidelity of reasoning and decision-making. Existing work relies on historical incident corpora, operates directly on high-volume telemetry beyond current LLM capacity, or embeds reasoning inside complex multi-agent pipelines -- conditions that obscure whether failures arise from reasoning itself or from peripheral design choices. We present a focused empirical evaluation that isolates an LLM's reasoning behavior. We design a controlled experimental framework that foregrounds the LLM by using a simplified experimental setting. We evaluate six LLMs under two agentic workflows (ReAct and Plan-and-Execute) and a non-agentic baseline on two real-world case studies (GAIA and OpenRCA). In total, we executed 48,000 simulated failure scenarios, totaling 228 days of execution time. We measure both root-cause accuracy and the quality of intermediate reasoning traces. We produce a labeled taxonomy of 16 common RCA reasoning failures and use an LLM-as-a-Judge for annotation. Our results clarify where current open-source LLMs succeed and fail in multi-hop RCA, quantify sensitivity to input data modalities, and identify reasoning failures that predict final correctness. Together, these contributions provide transparent and reproducible empirical results and a failure taxonomy to guide future work on reasoning-driven system diagnosis.
Abstract（参考訳）: ルート原因分析(RCA)は、システムの信頼性を確保するために複雑なソフトウェアシステム内の障害の診断に不可欠である。現代のクラウドベースシステムの高度に分散した相互依存の性質は、RCAの取り組みを複雑にすることが多い。大規模言語モデル(LLM)の最近の進歩は、自動化RCAを強化する新たな機会を提供する。しかし、RCAの実践的価値は、推論と意思決定の忠実さに依存している。既存の作業は、過去のインシデントコーパスに依存しており、現在のLLM容量を超える大量のテレメトリを直接運用したり、複雑なマルチエージェントパイプライン内に推論を組み込んでいます。 LLMの推論動作を分離する実験的な評価手法を提案する。我々は, 簡易な実験環境を用いて, LLMの前提となる制御された実験フレームワークを設計する。 2つのエージェントワークフロー(ReActとPlan-and-Execute)と2つの現実世界ケーススタディ(GAIAとOpenRCA)の非エージェントベースライン(非エージェントベースライン)に基づいて6つのLSMを評価した。合計48,000のシミュレートされた障害シナリオを実行し、合計228日間の実行を実行しました。我々は根本原因の精度と中間的推論トレースの品質を計測する。我々は16の共通RCA推論失敗の分類をラベル付きで作成し、アノテーションにLLM-as-a-Judgeを使用する。この結果から,マルチホップRCAにおいて,現在のLLMが成功・失敗する状況を明らかにし,入力データのモダリティに対する感度を定量化し,最終的な正当性を予測する推論失敗を同定した。これらのコントリビューションは、透過的で再現可能な実験結果と失敗分類を提供し、推論駆動型システム診断の今後の研究を導く。

論文の概要: Stalled, Biased, and Confused: Uncovering Reasoning Failures in LLMs for Cloud-Based Root Cause Analysis

関連論文リスト