Disentangled Causal Graph Learning for Online Unsupervised Root Cause
Analysis
- URL: http://arxiv.org/abs/2305.10638v3
- Date: Fri, 2 Jun 2023 21:08:25 GMT
- Title: Disentangled Causal Graph Learning for Online Unsupervised Root Cause
Analysis
- Authors: Dongjie Wang, Zhengzhang Chen, Yanjie Fu, Yanchi Liu, Haifeng Chen
- Abstract summary: Root cause analysis (RCA) can identify the root causes of system faults/failures by analyzing system monitoring data.
Previous research has mostly focused on developing offline RCA algorithms, which often require manually initiating the RCA process.
We propose CORAL, a novel online RCA framework that can automatically trigger the RCA process and incrementally update the RCA model.
- Score: 49.910053255238566
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The task of root cause analysis (RCA) is to identify the root causes of
system faults/failures by analyzing system monitoring data. Efficient RCA can
greatly accelerate system failure recovery and mitigate system damages or
financial losses. However, previous research has mostly focused on developing
offline RCA algorithms, which often require manually initiating the RCA
process, a significant amount of time and data to train a robust model, and
then being retrained from scratch for a new system fault.
In this paper, we propose CORAL, a novel online RCA framework that can
automatically trigger the RCA process and incrementally update the RCA model.
CORAL consists of Trigger Point Detection, Incremental Disentangled Causal
Graph Learning, and Network Propagation-based Root Cause Localization. The
Trigger Point Detection component aims to detect system state transitions
automatically and in near-real-time. To achieve this, we develop an online
trigger point detection approach based on multivariate singular spectrum
analysis and cumulative sum statistics. To efficiently update the RCA model, we
propose an incremental disentangled causal graph learning approach to decouple
the state-invariant and state-dependent information. After that, CORAL applies
a random walk with restarts to the updated causal graph to accurately identify
root causes. The online RCA process terminates when the causal graph and the
generated root cause list converge. Extensive experiments on three real-world
datasets with case studies demonstrate the effectiveness and superiority of the
proposed framework.
Related papers
- Online Multi-modal Root Cause Analysis [61.94987309148539]
Root Cause Analysis (RCA) is essential for pinpointing the root causes of failures in microservice systems.
Existing online RCA methods handle only single-modal data overlooking, complex interactions in multi-modal systems.
We introduce OCEAN, a novel online multi-modal causal structure learning method for root cause localization.
arXiv Detail & Related papers (2024-10-13T21:47:36Z) - KGroot: Enhancing Root Cause Analysis through Knowledge Graphs and Graph
Convolutional Neural Networks [14.336830860792707]
KGroot uses event knowledge and the correlation between events to perform root cause reasoning.
Experiments demonstrate KGroot can locate the root cause with accuracy of 93.5% top 3 potential causes in second-level.
arXiv Detail & Related papers (2024-02-11T10:30:38Z) - Multi-modal Causal Structure Learning and Root Cause Analysis [67.67578590390907]
We propose Mulan, a unified multi-modal causal structure learning method for root cause localization.
We leverage a log-tailored language model to facilitate log representation learning, converting log sequences into time-series data.
We also introduce a novel key performance indicator-aware attention mechanism for assessing modality reliability and co-learning a final causal graph.
arXiv Detail & Related papers (2024-02-04T05:50:38Z) - Automatic Root Cause Analysis via Large Language Models for Cloud
Incidents [51.94361026233668]
We introduce RCACopilot, an on-call system empowered by a large language model for automating root cause analysis of cloud incidents.
RCACopilot matches incoming incidents to corresponding incident handlers based on their alert types, aggregates the critical runtime diagnostic information, predicts the incident's root cause category, and provides an explanatory narrative.
We evaluate RCACopilot using a real-world dataset consisting of a year's worth of incidents from Microsoft.
arXiv Detail & Related papers (2023-05-25T06:44:50Z) - Hierarchical Graph Neural Networks for Causal Discovery and Root Cause
Localization [52.72490784720227]
REASON consists of Topological Causal Discovery and Individual Causal Discovery.
The Topological Causal Discovery component aims to model the fault propagation in order to trace back to the root causes.
The Individual Causal Discovery component focuses on capturing abrupt change patterns of a single system entity.
arXiv Detail & Related papers (2023-02-03T20:17:45Z) - Detecting and Ranking Causal Anomalies in End-to-End Complex System [10.02817768857185]
We propose a framework called Ranking Causal Anomalies in End-to-End System (RCAE2E)
Based on these problems, we propose a framework called Ranking Causal Anomalies in End-to-End System (RCAE2E)
arXiv Detail & Related papers (2023-01-18T03:09:28Z) - Causal Inference-Based Root Cause Analysis for Online Service Systems
with Intervention Recognition [11.067832313491449]
In this paper, we formulate the root cause analysis problem as a new causal inference task named intervention recognition.
We propose a novel unsupervised causal inference-based method named Causal Inference-based Root Cause Analysis (CIRCA)
The performance on a real-world dataset shows that CIRCA can improve the recall of the top-1 recommendation by 25% over the best baseline method.
arXiv Detail & Related papers (2022-06-13T01:45:13Z) - Mining Root Cause Knowledge from Cloud Service Incident Investigations
for AIOps [71.12026848664753]
Root Cause Analysis (RCA) of any service-disrupting incident is one of the most critical as well as complex tasks in IT processes.
In this work, we present ICA and the downstream Incident Search and Retrieval based RCA pipeline, built at Salesforce.
arXiv Detail & Related papers (2022-04-21T02:33:34Z) - Causal Discovery from Sparse Time-Series Data Using Echo State Network [0.0]
Causal discovery between collections of time-series data can help diagnose causes of symptoms and hopefully prevent faults before they occur.
We propose a new system comprised of two parts, the first part fills missing data with a Gaussian Process Regression, and the second part leverages an Echo State Network.
We report on their corresponding Matthews Correlation Coefficient(MCC) and Receiver Operating Characteristic curves (ROC) and show that the proposed system outperforms existing algorithms.
arXiv Detail & Related papers (2022-01-09T05:55:47Z) - An Influence-based Approach for Root Cause Alarm Discovery in Telecom
Networks [7.438302177990416]
In practice, accurate and self-adjustable alarm root cause analysis is a great challenge due to network complexity and vast amounts of alarms.
We propose a data-driven framework for root cause alarm localization, combining both causal inference and network embedding techniques.
We evaluate our method on artificial data and real-world telecom data, showing a significant improvement over the best baselines.
arXiv Detail & Related papers (2021-05-07T07:41:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.