Fugu-MT 論文翻訳(概要): Auditable Graph-Guided Root Cause Analysis for Kubernetes Incidents

論文の概要: Auditable Graph-Guided Root Cause Analysis for Kubernetes Incidents

arxiv url: http://arxiv.org/abs/2606.08590v1
Date: Sun, 07 Jun 2026 12:05:09 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-09 14:42:06.283293
Title: Auditable Graph-Guided Root Cause Analysis for Kubernetes Incidents
Title（参考訳）: Kubernetesインシデントに対する監査可能なグラフガイド根本原因解析
Authors: Anastasiia Kuvshinova, Seungmin Jin,
Abstract要約: LLM推論と特殊なツールを組み合わせたグラフ誘導RCAエージェントであるグラフトラバースエージェントを提案する。我々は、読み取り専用エビデンス収集、伝搬認識診断、有界実行、独立に検証された検証を含む運用上の制約をマップする。ある固定されたqwenオーバージャッジによってスコアされたITBenchスナップショットでは、監査されたシステムは、同じシステムの初期のイテレーションに対してルート因果F1を上昇させる。
参考スコア（独自算出の注目度）: 1.116726665785374
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Kubernetes incidents are diagnosed reliably only when a root-cause system's reported gains come from incident evidence rather than scenario-specific shortcuts. We present Graph Traversal Agent, a graph-guided RCA agent that combines LLM reasoning with specialized tools. The model reasons over a typed evidence graph, while deterministic graph and tool operations collect evidence, bound the search, and check proposed verdicts. We map operational constraints, including read-only evidence collection, propagation-aware diagnosis, bounded execution, and independently validated verdicts, to a typed incident graph, a LangGraph traversal state machine, and a separate validation stage. On ITBench snapshots scored by one fixed qwen-plus judge, the audited system raises root-cause-entity F1 over an earlier iteration of the same system from 0.6087 to 0.9130 on a 23-scenario common subset. A prompt-level ablation separates prompt-tuned gains from gains that survive once scenario-specific hints are removed: the stripped-prompt configuration retains 0.6958 F1 on a 19-scenario subset. The surviving gain concentrates on ChaosMesh scenarios whose ground-truth root cause is the injected fault object already present in the evidence graph, so we report it as benchmark-coupled rather than broad cross-cluster RCA evidence. Lightweight checks, including same-judge comparison, prompt-level ablation, cascade-source checking, and a telemetry no-leak test, mark claims as supported, pending, or out of scope. We scope the work to ITBench OpenTelemetry-demo snapshots. Live-cluster trials served as an engineering stress test, but alert state and trace availability did not stay stable enough for controlled scoring, so we make no production-readiness or mean-time-to-repair claim.
Abstract（参考訳）: Kubernetesインシデントは、シナリオ固有のショートカットではなく、インシデントエビデンスから根本原因のシステムが取得した場合にのみ、確実に診断される。 LLM推論と特殊なツールを組み合わせたグラフ誘導RCAエージェントであるグラフトラバースエージェントを提案する。型付きエビデンスグラフに対するモデルは、決定論的グラフとツール操作がエビデンスを収集し、検索をバウンドし、提案された判定をチェックする。我々は,読み取り専用エビデンス収集,伝搬認識診断,有界実行,独立に検証された検証などの運用上の制約を,型付きインシデントグラフ,LangGraphトラバース状態マシン,別段の検証ステージにマップする。ある固定されたqwen-plusの裁判官が取得したITBenchスナップショットでは、監査されたシステムは、23-scenarioの共通部分集合上の0.6087から0.9130まで、同じシステムの初期のイテレーションに対してルート因果F1を上昇させる。プロンプトレベルのアブレーションは、シナリオ固有のヒントが取り除かれると生き残るゲインからプロンプトチューニングされたゲインを分離する。生き残ったゲインは、エビデンスグラフにすでに存在するインジェクトされた障害オブジェクトが根本原因であるChaosMeshのシナリオに集中している。同ジャッジ比較、プロンプトレベルのアブレーション、カスケードソースチェック、テレメトリのノーリークテスト、サポートされた、保留された、スコープ外としてのマーククレームを含む軽量チェック。作業範囲はITBench OpenTelemetry-demoスナップショットです。ライブクラスタの試験は、エンジニアリングのストレステストとして機能したが、警告状態とトレースの可用性は、制御されたスコア付けに十分な安定性を持っていなかったため、プロダクションの可読性や平均タイム・ツー・レペアのクレームはありませんでした。

論文の概要: Auditable Graph-Guided Root Cause Analysis for Kubernetes Incidents

関連論文リスト