ESRO: Experience Assisted Service Reliability against Outages
- URL: http://arxiv.org/abs/2309.07230v1
- Date: Wed, 13 Sep 2023 18:04:52 GMT
- Title: ESRO: Experience Assisted Service Reliability against Outages
- Authors: Sarthak Chakraborty, Shubham Agarwal, Shaddy Garg, Abhimanyu Sethia,
Udit Narayan Pandey, Videh Aggarwal, Shiv Saini
- Abstract summary: We build a diagnostic service called ESRO that recommends root causes and remediation for failures.
We evaluate our model on several cloud service outages of a large enterprise over the course of 2 years.
- Score: 2.647000585570866
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern cloud services are prone to failures due to their complex
architecture, making diagnosis a critical process. Site Reliability Engineers
(SREs) spend hours leveraging multiple sources of data, including the alerts,
error logs, and domain expertise through past experiences to locate the root
cause(s). These experiences are documented as natural language text in outage
reports for previous outages. However, utilizing the raw yet rich
semi-structured information in the reports systematically is time-consuming.
Structured information, on the other hand, such as alerts that are often used
during fault diagnosis, is voluminous and requires expert knowledge to discern.
Several strategies have been proposed to use each source of data separately for
root cause analysis. In this work, we build a diagnostic service called ESRO
that recommends root causes and remediation for failures by utilizing
structured as well as semi-structured sources of data systematically. ESRO
constructs a causal graph using alerts and a knowledge graph using outage
reports, and merges them in a novel way to form a unified graph during
training. A retrieval-based mechanism is then used to search the unified graph
and rank the likely root causes and remediation techniques based on the alerts
fired during an outage at inference time. Not only the individual alerts, but
their respective importance in predicting an outage group is taken into account
during recommendation. We evaluated our model on several cloud service outages
of a large SaaS enterprise over the course of ~2 years, and obtained an average
improvement of 27% in rouge scores after comparing the likely root causes
against the ground truth over state-of-the-art baselines. We further establish
the effectiveness of ESRO through qualitative analysis on multiple real outage
examples.
Related papers
- LogRCA: Log-based Root Cause Analysis for Distributed Services [4.049637286678329]
We propose LogRCA, a novel method for identifying a minimal set of log lines that together describe a root cause.
LogRCA uses a semi-supervised learning approach to deal with rare and unknown errors and is designed to handle noisy data.
We evaluated our approach on a large-scale production log data set of 44.3 million log lines, which contains 80 failures, whose root causes were labeled by experts.
arXiv Detail & Related papers (2024-05-22T12:50:56Z) - Exploring LLM-based Agents for Root Cause Analysis [17.053079105858497]
Root cause analysis (RCA) is a critical part of the incident management process.
Large Language Models (LLMs) have been used to perform RCA, but are not able to collect additional diagnostic information.
We present an evaluation of a ReAct agent equipped with retrieval tools, on an out-of-distribution dataset of production incidents collected at Microsoft.
arXiv Detail & Related papers (2024-03-07T00:44:01Z) - KGroot: Enhancing Root Cause Analysis through Knowledge Graphs and Graph
Convolutional Neural Networks [14.336830860792707]
KGroot uses event knowledge and the correlation between events to perform root cause reasoning.
Experiments demonstrate KGroot can locate the root cause with accuracy of 93.5% top 3 potential causes in second-level.
arXiv Detail & Related papers (2024-02-11T10:30:38Z) - Multi-modal Causal Structure Learning and Root Cause Analysis [67.67578590390907]
We propose Mulan, a unified multi-modal causal structure learning method for root cause localization.
We leverage a log-tailored language model to facilitate log representation learning, converting log sequences into time-series data.
We also introduce a novel key performance indicator-aware attention mechanism for assessing modality reliability and co-learning a final causal graph.
arXiv Detail & Related papers (2024-02-04T05:50:38Z) - Disentangled Causal Graph Learning for Online Unsupervised Root Cause
Analysis [49.910053255238566]
Root cause analysis (RCA) can identify the root causes of system faults/failures by analyzing system monitoring data.
Previous research has mostly focused on developing offline RCA algorithms, which often require manually initiating the RCA process.
We propose CORAL, a novel online RCA framework that can automatically trigger the RCA process and incrementally update the RCA model.
arXiv Detail & Related papers (2023-05-18T01:27:48Z) - Mining Root Cause Knowledge from Cloud Service Incident Investigations
for AIOps [71.12026848664753]
Root Cause Analysis (RCA) of any service-disrupting incident is one of the most critical as well as complex tasks in IT processes.
In this work, we present ICA and the downstream Incident Search and Retrieval based RCA pipeline, built at Salesforce.
arXiv Detail & Related papers (2022-04-21T02:33:34Z) - A2Log: Attentive Augmented Log Anomaly Detection [53.06341151551106]
Anomaly detection becomes increasingly important for the dependability and serviceability of IT services.
Existing unsupervised methods need anomaly examples to obtain a suitable decision boundary.
We develop A2Log, which is an unsupervised anomaly detection method consisting of two steps: Anomaly scoring and anomaly decision.
arXiv Detail & Related papers (2021-09-20T13:40:21Z) - An Influence-based Approach for Root Cause Alarm Discovery in Telecom
Networks [7.438302177990416]
In practice, accurate and self-adjustable alarm root cause analysis is a great challenge due to network complexity and vast amounts of alarms.
We propose a data-driven framework for root cause alarm localization, combining both causal inference and network embedding techniques.
We evaluate our method on artificial data and real-world telecom data, showing a significant improvement over the best baselines.
arXiv Detail & Related papers (2021-05-07T07:41:46Z) - Robust and Transferable Anomaly Detection in Log Data using Pre-Trained
Language Models [59.04636530383049]
Anomalies or failures in large computer systems, such as the cloud, have an impact on a large number of users.
We propose a framework for anomaly detection in log data, as a major troubleshooting source of system information.
arXiv Detail & Related papers (2021-02-23T09:17:05Z) - TadGAN: Time Series Anomaly Detection Using Generative Adversarial
Networks [73.01104041298031]
TadGAN is an unsupervised anomaly detection approach built on Generative Adversarial Networks (GANs)
To capture the temporal correlations of time series, we use LSTM Recurrent Neural Networks as base models for Generators and Critics.
To demonstrate the performance and generalizability of our approach, we test several anomaly scoring techniques and report the best-suited one.
arXiv Detail & Related papers (2020-09-16T15:52:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.