Causal Inference-Based Root Cause Analysis for Online Service Systems
with Intervention Recognition
- URL: http://arxiv.org/abs/2206.05871v1
- Date: Mon, 13 Jun 2022 01:45:13 GMT
- Title: Causal Inference-Based Root Cause Analysis for Online Service Systems
with Intervention Recognition
- Authors: Mingjie Li, Zeyan Li, Kanglin Yin, Xiaohui Nie, Wenchi Zhang, Kaixin
Sui, Dan Pei
- Abstract summary: In this paper, we formulate the root cause analysis problem as a new causal inference task named intervention recognition.
We propose a novel unsupervised causal inference-based method named Causal Inference-based Root Cause Analysis (CIRCA)
The performance on a real-world dataset shows that CIRCA can improve the recall of the top-1 recommendation by 25% over the best baseline method.
- Score: 11.067832313491449
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Fault diagnosis is critical in many domains, as faults may lead to safety
threats or economic losses. In the field of online service systems, operators
rely on enormous monitoring data to detect and mitigate failures. Quickly
recognizing a small set of root cause indicators for the underlying fault can
save much time for failure mitigation. In this paper, we formulate the root
cause analysis problem as a new causal inference task named intervention
recognition. We proposed a novel unsupervised causal inference-based method
named Causal Inference-based Root Cause Analysis (CIRCA). The core idea is a
sufficient condition for a monitoring variable to be a root cause indicator,
i.e., the change of probability distribution conditioned on the parents in the
Causal Bayesian Network (CBN). Towards the application in online service
systems, CIRCA constructs a graph among monitoring metrics based on the
knowledge of system architecture and a set of causal assumptions. The
simulation study illustrates the theoretical reliability of CIRCA. The
performance on a real-world dataset further shows that CIRCA can improve the
recall of the top-1 recommendation by 25% over the best baseline method.
Related papers
- Online Multi-modal Root Cause Analysis [61.94987309148539]
Root Cause Analysis (RCA) is essential for pinpointing the root causes of failures in microservice systems.
Existing online RCA methods handle only single-modal data overlooking, complex interactions in multi-modal systems.
We introduce OCEAN, a novel online multi-modal causal structure learning method for root cause localization.
arXiv Detail & Related papers (2024-10-13T21:47:36Z) - PORCA: Root Cause Analysis with Partially Observed Data [15.007249208547885]
Root Cause Analysis (RCA) aims at identifying the underlying causes of system faults by uncovering and analyzing the causal structure from complex systems.
Previous studies implicitly assume a full observation of the system, which neglect the effect of partial observation.
We propose PORCA, a novel RCA framework which can explore reliable root causes under both unobserved confounders and unobserved heterogeneity.
arXiv Detail & Related papers (2024-07-08T12:31:12Z) - KGroot: Enhancing Root Cause Analysis through Knowledge Graphs and Graph
Convolutional Neural Networks [14.336830860792707]
KGroot uses event knowledge and the correlation between events to perform root cause reasoning.
Experiments demonstrate KGroot can locate the root cause with accuracy of 93.5% top 3 potential causes in second-level.
arXiv Detail & Related papers (2024-02-11T10:30:38Z) - On the Fly Detection of Root Causes from Observed Data with Application to IT Systems [3.3321350585823826]
This paper introduces a new structural causal model tailored for representing threshold-based IT systems.
It presents a new algorithm designed to rapidly detect root causes of anomalies in such systems.
arXiv Detail & Related papers (2024-02-09T16:10:19Z) - Analyzing Adversarial Inputs in Deep Reinforcement Learning [53.3760591018817]
We present a comprehensive analysis of the characterization of adversarial inputs, through the lens of formal verification.
We introduce a novel metric, the Adversarial Rate, to classify models based on their susceptibility to such perturbations.
Our analysis empirically demonstrates how adversarial inputs can affect the safety of a given DRL system with respect to such perturbations.
arXiv Detail & Related papers (2024-02-07T21:58:40Z) - Multi-modal Causal Structure Learning and Root Cause Analysis [67.67578590390907]
We propose Mulan, a unified multi-modal causal structure learning method for root cause localization.
We leverage a log-tailored language model to facilitate log representation learning, converting log sequences into time-series data.
We also introduce a novel key performance indicator-aware attention mechanism for assessing modality reliability and co-learning a final causal graph.
arXiv Detail & Related papers (2024-02-04T05:50:38Z) - Causal Disentanglement Hidden Markov Model for Fault Diagnosis [55.90917958154425]
We propose a Causal Disentanglement Hidden Markov model (CDHM) to learn the causality in the bearing fault mechanism.
Specifically, we make full use of the time-series data and progressively disentangle the vibration signal into fault-relevant and fault-irrelevant factors.
To expand the scope of the application, we adopt unsupervised domain adaptation to transfer the learned disentangled representations to other working environments.
arXiv Detail & Related papers (2023-08-06T05:58:45Z) - Disentangled Causal Graph Learning for Online Unsupervised Root Cause
Analysis [49.910053255238566]
Root cause analysis (RCA) can identify the root causes of system faults/failures by analyzing system monitoring data.
Previous research has mostly focused on developing offline RCA algorithms, which often require manually initiating the RCA process.
We propose CORAL, a novel online RCA framework that can automatically trigger the RCA process and incrementally update the RCA model.
arXiv Detail & Related papers (2023-05-18T01:27:48Z) - An Influence-based Approach for Root Cause Alarm Discovery in Telecom
Networks [7.438302177990416]
In practice, accurate and self-adjustable alarm root cause analysis is a great challenge due to network complexity and vast amounts of alarms.
We propose a data-driven framework for root cause alarm localization, combining both causal inference and network embedding techniques.
We evaluate our method on artificial data and real-world telecom data, showing a significant improvement over the best baselines.
arXiv Detail & Related papers (2021-05-07T07:41:46Z) - Causal Inference Q-Network: Toward Resilient Reinforcement Learning [57.96312207429202]
We consider a resilient DRL framework with observational interferences.
Under this framework, we propose a causal inference based DRL algorithm called causal inference Q-network (CIQ)
Our experimental results show that the proposed CIQ method could achieve higher performance and more resilience against observational interferences.
arXiv Detail & Related papers (2021-02-18T23:50:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.