Automatic Root Cause Analysis via Large Language Models for Cloud
Incidents
- URL: http://arxiv.org/abs/2305.15778v4
- Date: Mon, 13 Nov 2023 05:05:31 GMT
- Title: Automatic Root Cause Analysis via Large Language Models for Cloud
Incidents
- Authors: Yinfang Chen, Huaibing Xie, Minghua Ma, Yu Kang, Xin Gao, Liu Shi,
Yunjie Cao, Xuedong Gao, Hao Fan, Ming Wen, Jun Zeng, Supriyo Ghosh, Xuchao
Zhang, Chaoyun Zhang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Tianyin
Xu
- Abstract summary: We introduce RCACopilot, an on-call system empowered by a large language model for automating root cause analysis of cloud incidents.
RCACopilot matches incoming incidents to corresponding incident handlers based on their alert types, aggregates the critical runtime diagnostic information, predicts the incident's root cause category, and provides an explanatory narrative.
We evaluate RCACopilot using a real-world dataset consisting of a year's worth of incidents from Microsoft.
- Score: 51.94361026233668
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Ensuring the reliability and availability of cloud services necessitates
efficient root cause analysis (RCA) for cloud incidents. Traditional RCA
methods, which rely on manual investigations of data sources such as logs and
traces, are often laborious, error-prone, and challenging for on-call
engineers. In this paper, we introduce RCACopilot, an innovative on-call system
empowered by the large language model for automating RCA of cloud incidents.
RCACopilot matches incoming incidents to corresponding incident handlers based
on their alert types, aggregates the critical runtime diagnostic information,
predicts the incident's root cause category, and provides an explanatory
narrative. We evaluate RCACopilot using a real-world dataset consisting of a
year's worth of incidents from Microsoft. Our evaluation demonstrates that
RCACopilot achieves RCA accuracy up to 0.766. Furthermore, the diagnostic
information collection component of RCACopilot has been successfully in use at
Microsoft for over four years.
Related papers
- Exploring LLM-based Agents for Root Cause Analysis [17.053079105858497]
Root cause analysis (RCA) is a critical part of the incident management process.
Large Language Models (LLMs) have been used to perform RCA, but are not able to collect additional diagnostic information.
We present an evaluation of a ReAct agent equipped with retrieval tools, on an out-of-distribution dataset of production incidents collected at Microsoft.
arXiv Detail & Related papers (2024-03-07T00:44:01Z) - TraceDiag: Adaptive, Interpretable, and Efficient Root Cause Analysis on
Large-Scale Microservice Systems [44.53009495726297]
Root Cause Analysis (RCA) is becoming increasingly crucial for ensuring the reliability of microservice systems.
This paper proposes TraceDiag, an end-to-end RCA framework that addresses the challenges for large-scale microservice systems.
arXiv Detail & Related papers (2023-10-28T15:49:00Z) - RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models [46.476439550746136]
Large language model (LLM) applications in cloud root cause analysis (RCA) have been actively explored recently.
We present RCAgent, a tool-augmented LLM autonomous agent framework for practical and privacy-aware industrial RCA usage.
Running on an internally deployed model rather than GPT families, RCAgent is capable of free-form data collection and comprehensive analysis with tools.
arXiv Detail & Related papers (2023-10-25T03:53:31Z) - PyRCA: A Library for Metric-based Root Cause Analysis [66.72542200701807]
PyRCA is an open-source machine learning library of Root Cause Analysis (RCA) for Artificial Intelligence for IT Operations (AIOps)
It provides a holistic framework to uncover the complicated metric causal dependencies and automatically locate root causes of incidents.
arXiv Detail & Related papers (2023-06-20T09:55:10Z) - Disentangled Causal Graph Learning for Online Unsupervised Root Cause
Analysis [49.910053255238566]
Root cause analysis (RCA) can identify the root causes of system faults/failures by analyzing system monitoring data.
Previous research has mostly focused on developing offline RCA algorithms, which often require manually initiating the RCA process.
We propose CORAL, a novel online RCA framework that can automatically trigger the RCA process and incrementally update the RCA model.
arXiv Detail & Related papers (2023-05-18T01:27:48Z) - Mining Root Cause Knowledge from Cloud Service Incident Investigations
for AIOps [71.12026848664753]
Root Cause Analysis (RCA) of any service-disrupting incident is one of the most critical as well as complex tasks in IT processes.
In this work, we present ICA and the downstream Incident Search and Retrieval based RCA pipeline, built at Salesforce.
arXiv Detail & Related papers (2022-04-21T02:33:34Z) - DAE : Discriminatory Auto-Encoder for multivariate time-series anomaly
detection in air transportation [68.8204255655161]
We propose a novel anomaly detection model called Discriminatory Auto-Encoder (DAE)
It uses the baseline of a regular LSTM-based auto-encoder but with several decoders, each getting data of a specific flight phase.
Results show that the DAE achieves better results in both accuracy and speed of detection.
arXiv Detail & Related papers (2021-09-08T14:07:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.