Fugu-MT 論文翻訳(概要): Simplifying Root Cause Analysis in Kubernetes with StateGraph and LLM

論文の概要: Simplifying Root Cause Analysis in Kubernetes with StateGraph and LLM

arxiv url: http://arxiv.org/abs/2506.02490v1
Date: Tue, 03 Jun 2025 06:09:13 GMT
ステータス: 翻訳完了
システム内更新日: 2025-06-04 21:47:35.312452
Title: Simplifying Root Cause Analysis in Kubernetes with StateGraph and LLM
Title（参考訳）: StateGraphとLLMによるKubernetesのルート原因分析の簡略化
Authors: Yong Xiang, Charley Peter Chen, Liyi Zeng, Wei Yin, Xin Liu, Hu Li, Wei Xu,
Abstract要約: 我々は根本原因分析のための革新的なツールであるSynergyRCAを紹介する。 SynergyRCAは、空間的および時間的関係をキャプチャするStateGraphを構築する。約2分間の平均で根本原因を特定でき、約0.90の精度を達成できる。
参考スコア（独自算出の注目度）: 13.293736787442414
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Kubernetes, a notably complex and distributed system, utilizes an array of controllers to uphold cluster management logic through state reconciliation. Nevertheless, maintaining state consistency presents significant challenges due to unexpected failures, network disruptions, and asynchronous issues, especially within dynamic cloud environments. These challenges result in operational disruptions and economic losses, underscoring the necessity for robust root cause analysis (RCA) to enhance Kubernetes reliability. The development of large language models (LLMs) presents a promising direction for RCA. However, existing methodologies encounter several obstacles, including the diverse and evolving nature of Kubernetes incidents, the intricate context of incidents, and the polymorphic nature of these incidents. In this paper, we introduce SynergyRCA, an innovative tool that leverages LLMs with retrieval augmentation from graph databases and enhancement with expert prompts. SynergyRCA constructs a StateGraph to capture spatial and temporal relationships and utilizes a MetaGraph to outline entity connections. Upon the occurrence of an incident, an LLM predicts the most pertinent resource, and SynergyRCA queries the MetaGraph and StateGraph to deliver context-specific insights for RCA. We evaluate SynergyRCA using datasets from two production Kubernetes clusters, highlighting its capacity to identify numerous root causes, including novel ones, with high efficiency and precision. SynergyRCA demonstrates the ability to identify root causes in an average time of about two minutes and achieves an impressive precision of approximately 0.90.
Abstract（参考訳）: 特に複雑で分散システムのKubernetesは、一連のコントローラを使用して、状態の整合性を通じてクラスタ管理ロジックを維持している。それでも状態整合性の維持は,特に動的クラウド環境において,予期せぬ障害やネットワーク障害,非同期問題などによる重大な課題を生じさせる。これらの課題は運用上の障害と経済的な損失をもたらし、Kubernetesの信頼性を高めるために堅牢な根本原因分析(RCA)の必要性を強調している。大規模言語モデル(LLM)の開発はRCAにとって有望な方向性を示す。しかしながら、既存の方法論では、Kubernetesインシデントの多様性と進化する性質、複雑なインシデントコンテキスト、これらのインシデントの多型性など、いくつかの障害に直面している。本稿では,グラフデータベースからの検索拡張と専門家のプロンプトによる拡張によりLLMを活用する革新的なツールであるSynergyRCAを紹介する。 SynergyRCAは、空間的および時間的関係をキャプチャするStateGraphを構築し、MetaGraphを使用してエンティティ接続を概説する。インシデントが発生した場合、LCMは最も関連するリソースを予測し、SynergyRCAはMetaGraphとStateGraphに問い合わせて、RCAにコンテキスト固有の洞察を提供する。 2つのプロダクションKubernetesクラスタのデータセットを使用してSynergyRCAを評価し、新しいものを含む多数の根本原因を高い効率と精度で特定する能力を強調した。 SynergyRCAは、平均で約2分間の根本原因を特定する能力を示し、約0.90の精度を達成している。

論文の概要: Simplifying Root Cause Analysis in Kubernetes with StateGraph and LLM

関連論文リスト