BALANCE: Bayesian Linear Attribution for Root Cause Localization
- URL: http://arxiv.org/abs/2301.13572v1
- Date: Tue, 31 Jan 2023 11:49:26 GMT
- Title: BALANCE: Bayesian Linear Attribution for Root Cause Localization
- Authors: Chaoyu Chen, Hang Yu, Zhichao Lei, Jianguo Li, Shaokang Ren, Tingkai
Zhang, Silin Hu, Jianchao Wang, Wenhui Shi
- Abstract summary: Root Cause Analysis (RCA) plays an indispensable role in distributed data system maintenance and operations.
This paper opens up the possibilities of exploiting the recently developed framework of explainable AI (XAI) for the purpose of RCA.
We propose BALANCE, which formulates the problem of RCA through the lens of attribution in XAI.
- Score: 19.30952654225615
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Root Cause Analysis (RCA) plays an indispensable role in distributed data
system maintenance and operations, as it bridges the gap between fault
detection and system recovery. Existing works mainly study multidimensional
localization or graph-based root cause localization. This paper opens up the
possibilities of exploiting the recently developed framework of explainable AI
(XAI) for the purpose of RCA. In particular, we propose BALANCE (BAyesian
Linear AttributioN for root CausE localization), which formulates the problem
of RCA through the lens of attribution in XAI and seeks to explain the
anomalies in the target KPIs by the behavior of the candidate root causes.
BALANCE consists of three innovative components. First, we propose a Bayesian
multicollinear feature selection (BMFS) model to predict the target KPIs given
the candidate root causes in a forward manner while promoting sparsity and
concurrently paying attention to the correlation between the candidate root
causes. Second, we introduce attribution analysis to compute the attribution
score for each candidate in a backward manner. Third, we merge the estimated
root causes related to each KPI if there are multiple KPIs. We extensively
evaluate the proposed BALANCE method on one synthesis dataset as well as three
real-world RCA tasks, that is, bad SQL localization, container fault
localization, and fault type diagnosis for Exathlon. Results show that BALANCE
outperforms the state-of-the-art (SOTA) methods in terms of accuracy with the
least amount of running time, and achieves at least $6\%$ notably higher
accuracy than SOTA methods for real tasks. BALANCE has been deployed to
production to tackle real-world RCA problems, and the online results further
advocate its usage for real-time diagnosis in distributed data systems.
Related papers
- PORCA: Root Cause Analysis with Partially Observed Data [15.007249208547885]
Root Cause Analysis (RCA) aims at identifying the underlying causes of system faults by uncovering and analyzing the causal structure from complex systems.
Previous studies implicitly assume a full observation of the system, which neglect the effect of partial observation.
We propose PORCA, a novel RCA framework which can explore reliable root causes under both unobserved confounders and unobserved heterogeneity.
arXiv Detail & Related papers (2024-07-08T12:31:12Z) - KGroot: Enhancing Root Cause Analysis through Knowledge Graphs and Graph
Convolutional Neural Networks [14.336830860792707]
KGroot uses event knowledge and the correlation between events to perform root cause reasoning.
Experiments demonstrate KGroot can locate the root cause with accuracy of 93.5% top 3 potential causes in second-level.
arXiv Detail & Related papers (2024-02-11T10:30:38Z) - Multi-modal Causal Structure Learning and Root Cause Analysis [67.67578590390907]
We propose Mulan, a unified multi-modal causal structure learning method for root cause localization.
We leverage a log-tailored language model to facilitate log representation learning, converting log sequences into time-series data.
We also introduce a novel key performance indicator-aware attention mechanism for assessing modality reliability and co-learning a final causal graph.
arXiv Detail & Related papers (2024-02-04T05:50:38Z) - Understanding, Predicting and Better Resolving Q-Value Divergence in
Offline-RL [86.0987896274354]
We first identify a fundamental pattern, self-excitation, as the primary cause of Q-value estimation divergence in offline RL.
We then propose a novel Self-Excite Eigenvalue Measure (SEEM) metric to measure the evolving property of Q-network at training.
For the first time, our theory can reliably decide whether the training will diverge at an early stage.
arXiv Detail & Related papers (2023-10-06T17:57:44Z) - Causal Disentanglement Hidden Markov Model for Fault Diagnosis [55.90917958154425]
We propose a Causal Disentanglement Hidden Markov model (CDHM) to learn the causality in the bearing fault mechanism.
Specifically, we make full use of the time-series data and progressively disentangle the vibration signal into fault-relevant and fault-irrelevant factors.
To expand the scope of the application, we adopt unsupervised domain adaptation to transfer the learned disentangled representations to other working environments.
arXiv Detail & Related papers (2023-08-06T05:58:45Z) - Disentangled Causal Graph Learning for Online Unsupervised Root Cause
Analysis [49.910053255238566]
Root cause analysis (RCA) can identify the root causes of system faults/failures by analyzing system monitoring data.
Previous research has mostly focused on developing offline RCA algorithms, which often require manually initiating the RCA process.
We propose CORAL, a novel online RCA framework that can automatically trigger the RCA process and incrementally update the RCA model.
arXiv Detail & Related papers (2023-05-18T01:27:48Z) - Hierarchical Graph Neural Networks for Causal Discovery and Root Cause
Localization [52.72490784720227]
REASON consists of Topological Causal Discovery and Individual Causal Discovery.
The Topological Causal Discovery component aims to model the fault propagation in order to trace back to the root causes.
The Individual Causal Discovery component focuses on capturing abrupt change patterns of a single system entity.
arXiv Detail & Related papers (2023-02-03T20:17:45Z) - Causal Inference-Based Root Cause Analysis for Online Service Systems
with Intervention Recognition [11.067832313491449]
In this paper, we formulate the root cause analysis problem as a new causal inference task named intervention recognition.
We propose a novel unsupervised causal inference-based method named Causal Inference-based Root Cause Analysis (CIRCA)
The performance on a real-world dataset shows that CIRCA can improve the recall of the top-1 recommendation by 25% over the best baseline method.
arXiv Detail & Related papers (2022-06-13T01:45:13Z) - Learning Dependencies in Distributed Cloud Applications to Identify and
Localize Anomalies [58.88325379746632]
We present Arvalus and its variant D-Arvalus, a neural graph transformation method that models system components as nodes and their dependencies as edges to improve the identification and localization of anomalies.
Given a series of metric, our method predicts the most likely system state - either normal or an anomaly class - and performs localization when an anomaly is detected.
The evaluation shows the generally good prediction performance of Arvalus and reveals the advantage of D-Arvalus which incorporates information about system component dependencies.
arXiv Detail & Related papers (2021-03-09T06:34:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.