Fugu-MT 論文翻訳(概要): Agentic Troubleshooting Guide Automation for Incident Management

論文の概要: Agentic Troubleshooting Guide Automation for Incident Management

arxiv url: http://arxiv.org/abs/2510.10074v1
Date: Sat, 11 Oct 2025 07:18:36 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 18:06:29.767742
Title: Agentic Troubleshooting Guide Automation for Incident Management
Title（参考訳）: インシデント管理のためのエージェントトラブルシューティングガイドの自動化
Authors: Jiayi Mao, Liqun Li, Yanjie Gao, Zegang Peng, Shilin He, Chaoyun Zhang, Si Qin, Samia Khalid, Qingwei Lin, Saravan Rajmohan, Sitaram Lanka, Dongmei Zhang,
Abstract要約: StepFlyは、トラブルシューティングガイド自動化のための新しいエンドツーエンドのエージェントフレームワークである。 StepFly は GPT-4.1 で 94% の成功率を達成した。並列化可能なTSGに対して32.9%から70.4%の大幅な実行時間短縮を実現している。
参考スコア（独自算出の注目度）: 46.78600624203546
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Effective incident management in large-scale IT systems relies on troubleshooting guides (TSGs), but their manual execution is slow and error-prone. While recent advances in LLMs offer promise for automating incident management tasks, existing LLM-based solutions lack specialized support for several key challenges, including managing TSG quality issues, interpreting complex control flow, handling data-intensive queries, and exploiting execution parallelism. We first conducted an empirical study on 92 real-world TSGs, and, guided by our findings, we present StepFly, a novel end-to-end agentic framework for troubleshooting guide automation. Our approach features a three-stage workflow: the first stage provides a comprehensive guide together with a tool, TSG Mentor, to assist SREs in improving TSG quality; the second stage performs offline preprocessing using LLMs to extract structured execution DAGs from unstructured TSGs and to create dedicated Query Preparation Plugins (QPPs); and the third stage executes online using a DAG-guided scheduler-executor framework with a memory system to guarantee correct workflow and support parallel execution of independent steps. Our empirical evaluation on a collection of real-world TSGs and incidents demonstrates that StepFly achieves a ~94% success rate on GPT-4.1, outperforming baselines with less time and token consumption. Furthermore, it achieves a remarkable execution time reduction of 32.9% to 70.4% for parallelizable TSGs.
Abstract（参考訳）: 大規模ITシステムにおける効果的なインシデント管理はトラブルシューティングガイド(TSG)に依存している。 LLMの最近の進歩はインシデント管理タスクを自動化することを約束しているが、既存のLLMベースのソリューションではTSGの品質問題の管理、複雑な制御フローの解釈、データ集約クエリの処理、実行並列化の活用など、いくつかの重要な課題に対する特別なサポートが欠如している。我々はまず,92の現実世界TSGについて実証的研究を行い,本研究の成果に導かれて,トラブルシューティングガイド自動化のための新しいエンドツーエンドエージェントフレームワークであるStepFlyを提示した。第1段階は、TSGの品質向上のためのSREを支援するための総合的なガイドTSG Mentor、第2段階は、構造化されていないTSGから構造化された実行DAGを抽出し、専用のクエリ準備プラグイン(QPP)を作成するためにLLMを使用してオフライン前処理を行い、第3段階は、正しいワークフローを保証するためのメモリシステムを備えたDAG誘導スケジューラ-エグゼクタフレームワークを使用して、オンラインで実行される。実世界のTSGとインシデントの収集に関する実証的な評価は、StepFlyがGPT-4.1で約94%の成功率を獲得し、時間とトークン消費の少ないベースラインを上回っていることを示している。さらに、並列化可能なTSGに対して32.9%から70.4%の大幅な実行時間短縮を実現している。

論文の概要: Agentic Troubleshooting Guide Automation for Incident Management

関連論文リスト