Mining Root Cause Knowledge from Cloud Service Incident Investigations
for AIOps
- URL: http://arxiv.org/abs/2204.11598v1
- Date: Thu, 21 Apr 2022 02:33:34 GMT
- Title: Mining Root Cause Knowledge from Cloud Service Incident Investigations
for AIOps
- Authors: Amrita Saha, Steven C.H. Hoi
- Abstract summary: Root Cause Analysis (RCA) of any service-disrupting incident is one of the most critical as well as complex tasks in IT processes.
In this work, we present ICA and the downstream Incident Search and Retrieval based RCA pipeline, built at Salesforce.
- Score: 71.12026848664753
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Root Cause Analysis (RCA) of any service-disrupting incident is one of the
most critical as well as complex tasks in IT processes, especially for cloud
industry leaders like Salesforce. Typically RCA investigation leverages
data-sources like application error logs or service call traces. However a rich
goldmine of root cause information is also hidden in the natural language
documentation of the past incidents investigations by domain experts. This is
generally termed as Problem Review Board (PRB) Data which constitute a core
component of IT Incident Management. However, owing to the raw unstructured
nature of PRBs, such root cause knowledge is not directly reusable by manual or
automated pipelines for RCA of new incidents. This motivates us to leverage
this widely-available data-source to build an Incident Causation Analysis (ICA)
engine, using SoTA neural NLP techniques to extract targeted information and
construct a structured Causal Knowledge Graph from PRB documents. ICA forms the
backbone of a simple-yet-effective Retrieval based RCA for new incidents,
through an Information Retrieval system to search and rank past incidents and
detect likely root causes from them, given the incident symptom. In this work,
we present ICA and the downstream Incident Search and Retrieval based RCA
pipeline, built at Salesforce, over 2K documented cloud service incident
investigations collected over a few years. We also establish the effectiveness
of ICA and the downstream tasks through various quantitative benchmarks,
qualitative analysis as well as domain expert's validation and real incident
case studies after deployment.
Related papers
- Exploring LLM-based Agents for Root Cause Analysis [17.053079105858497]
Root cause analysis (RCA) is a critical part of the incident management process.
Large Language Models (LLMs) have been used to perform RCA, but are not able to collect additional diagnostic information.
We present an evaluation of a ReAct agent equipped with retrieval tools, on an out-of-distribution dataset of production incidents collected at Microsoft.
arXiv Detail & Related papers (2024-03-07T00:44:01Z) - Multi-modal Causal Structure Learning and Root Cause Analysis [67.67578590390907]
We propose Mulan, a unified multi-modal causal structure learning method for root cause localization.
We leverage a log-tailored language model to facilitate log representation learning, converting log sequences into time-series data.
We also introduce a novel key performance indicator-aware attention mechanism for assessing modality reliability and co-learning a final causal graph.
arXiv Detail & Related papers (2024-02-04T05:50:38Z) - Root Cause Analysis In Microservice Using Neural Granger Causal
Discovery [12.35924469567586]
We propose RUN, a novel approach for root cause analysis using neural Granger causal discovery with contrastive learning.
RUN enhances the backbone encoder by integrating contextual information from time series, and leverages a time series forecasting model to conduct neural Granger causal discovery.
In addition, RUN incorporates Pagerank with a vector to efficiently recommend the top-k root causes.
arXiv Detail & Related papers (2024-02-02T04:43:06Z) - ESRO: Experience Assisted Service Reliability against Outages [2.647000585570866]
We build a diagnostic service called ESRO that recommends root causes and remediation for failures.
We evaluate our model on several cloud service outages of a large enterprise over the course of 2 years.
arXiv Detail & Related papers (2023-09-13T18:04:52Z) - PyRCA: A Library for Metric-based Root Cause Analysis [66.72542200701807]
PyRCA is an open-source machine learning library of Root Cause Analysis (RCA) for Artificial Intelligence for IT Operations (AIOps)
It provides a holistic framework to uncover the complicated metric causal dependencies and automatically locate root causes of incidents.
arXiv Detail & Related papers (2023-06-20T09:55:10Z) - Automatic Root Cause Analysis via Large Language Models for Cloud
Incidents [51.94361026233668]
We introduce RCACopilot, an on-call system empowered by a large language model for automating root cause analysis of cloud incidents.
RCACopilot matches incoming incidents to corresponding incident handlers based on their alert types, aggregates the critical runtime diagnostic information, predicts the incident's root cause category, and provides an explanatory narrative.
We evaluate RCACopilot using a real-world dataset consisting of a year's worth of incidents from Microsoft.
arXiv Detail & Related papers (2023-05-25T06:44:50Z) - Disentangled Causal Graph Learning for Online Unsupervised Root Cause
Analysis [49.910053255238566]
Root cause analysis (RCA) can identify the root causes of system faults/failures by analyzing system monitoring data.
Previous research has mostly focused on developing offline RCA algorithms, which often require manually initiating the RCA process.
We propose CORAL, a novel online RCA framework that can automatically trigger the RCA process and incrementally update the RCA model.
arXiv Detail & Related papers (2023-05-18T01:27:48Z) - A Pipeline for Business Intelligence and Data-Driven Root Cause Analysis
on Categorical Data [0.0]
This paper proposes a new clustering + association rule mining pipeline for getting business insights from data.
The occurrence of any event is explained by its antecedents in the generated rules.
arXiv Detail & Related papers (2022-11-12T18:12:10Z) - Retrieval-Augmented Reinforcement Learning [63.32076191982944]
We train a network to map a dataset of past experiences to optimal behavior.
The retrieval process is trained to retrieve information from the dataset that may be useful in the current context.
We show that retrieval-augmented R2D2 learns significantly faster than the baseline R2D2 agent and achieves higher scores.
arXiv Detail & Related papers (2022-02-17T02:44:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.