Fugu-MT 論文翻訳(概要): STAR: A Stage-attributed Triage and Repair framework for RCA Agents in Microservices

論文の概要: STAR: A Stage-attributed Triage and Repair framework for RCA Agents in Microservices

arxiv url: http://arxiv.org/abs/2605.15581v1
Date: Fri, 15 May 2026 03:44:39 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 03:45:13.157394
Title: STAR: A Stage-attributed Triage and Repair framework for RCA Agents in Microservices
Title（参考訳）: STAR: マイクロサービスにおけるRCAエージェントの段階的なトリアージと修復フレームワーク
Authors: Junle Wang, Xingchuang Liao, Wenjun Wu,
Abstract要約: 間違ったRCAトレースを修復するためのemphStage-attributed Triage and repair frameworkである textbfSTAR を提案する。 STARは、RCAワークフローを、EmphEvidence Package(EP)、emphHypothesis Set(HS)、emphAnalysis Structure(AS)、emphDecision Report(DR)の4つの構造化ステージに明示的に分解する。 LangGraph上に構築されたSTARは、ステージワイド監査、予算対応のemphFast/Slow、emphを実行する
参考スコア（独自算出の注目度）: 10.602349579405832
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: LLM-based root cause analysis (RCA) agents have recently emerged as a promising paradigm for incident diagnosis in microservice AIOps. However, their reliability remains fragile: an error in early evidence collection, hypothesis formulation, or causal analysis can propagate through the reasoning trace and eventually corrupt the final diagnosis. In this paper, we present \textbf{STAR}, a \emph{Stage-attributed Triage and Repair} framework for repairing erroneous RCA traces. STAR explicitly decomposes an RCA workflow into four structured stages, namely \emph{Evidence Package} (EP), \emph{Hypothesis Set} (HS), \emph{Analysis Structure} (AS), and \emph{Decision Report} (DR), and treats agent failure as a stage-localizable reasoning bug rather than a monolithic end-to-end error. Built on top of LangGraph, STAR performs stage-wise auditing, budget-aware \emph{Fast/Slow Routing}, \emph{decisive stage localization via counterfactual candidate evaluation}, and stage-specific patch-and-replay repair. We evaluate STAR on a public large-scale benchmark and a real-world production dataset, using two RCA agent workflows and three foundation models. Experimental results show that STAR consistently improves both root cause localization and fault type classification over strong baselines. Moreover, STAR identifies the decisive faulty stage with high accuracy, repairs most initially incorrect traces within one or two replay rounds, and benefits substantially from both Fast/Slow Routing and counterfactual stage evaluation. These results suggest that explicitly modeling \emph{where} an RCA agent fails is an effective path toward reliable, debuggable, and self-repairing agentic RCA systems.
Abstract（参考訳）: LLMベースの根本原因分析(RCA)エージェントが先頃,マイクロサービスAIOpsにおけるインシデント診断の有望なパラダイムとして登場した。初期の証拠収集、仮説定式化、因果解析の誤りは、推論の痕跡を通じて伝播し、最終的に最終的な診断を損なう。本稿では,不正なRCAトレースを修復するためのフレームワークであるtextbf{STAR}について述べる。 STARは、RCAワークフローを明示的に4つの構造化段階に分解する: \emph{Evidence Package} (EP), \emph{Hypothesis Set} (HS), \emph{Analysis Structure} (AS), \emph{Decision Report} (DR)。 LangGraph上に構築されたSTARは、ステージワイド監査、予算対応の \emph{Fast/Slow Routing} 、反ファクト的候補評価による \emph{decisive ステージローカライゼーション、ステージ固有のパッチ・アンド・リプレイ修復を実行する。 2つのRCAエージェントワークフローと3つの基礎モデルを用いて、パブリックな大規模ベンチマークと実世界の実運用データセット上でSTARを評価する。実験結果から,STARは強い基準線上での根本原因の局在化と断層型分類の両方を一貫して改善することが示された。さらに、STARは決定的な欠陥ステージを高い精度で識別し、最初は1回か2回のリプレイラウンドで間違ったトレースを修復する。これらの結果から, RCAエージェントが失敗する場所を明示的にモデル化することは, 信頼性, デバッグ性, 自己修復性のある RCA エージェントシステムへの効果的な経路であることが示唆された。

論文の概要: STAR: A Stage-attributed Triage and Repair framework for RCA Agents in Microservices

関連論文リスト