Related papers: BALANCE: Bayesian Linear Attribution for Root Cause Localization

BALANCE: Bayesian Linear Attribution for Root Cause Localization

URL: http://arxiv.org/abs/2301.13572v1
Date: Tue, 31 Jan 2023 11:49:26 GMT
Title: BALANCE: Bayesian Linear Attribution for Root Cause Localization
Authors: Chaoyu Chen, Hang Yu, Zhichao Lei, Jianguo Li, Shaokang Ren, Tingkai Zhang, Silin Hu, Jianchao Wang, Wenhui Shi
Abstract summary: Root Cause Analysis (RCA) plays an indispensable role in distributed data system maintenance and operations. This paper opens up the possibilities of exploiting the recently developed framework of explainable AI (XAI) for the purpose of RCA. We propose BALANCE, which formulates the problem of RCA through the lens of attribution in XAI.
Score: 19.30952654225615
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Root Cause Analysis (RCA) plays an indispensable role in distributed data system maintenance and operations, as it bridges the gap between fault detection and system recovery. Existing works mainly study multidimensional localization or graph-based root cause localization. This paper opens up the possibilities of exploiting the recently developed framework of explainable AI (XAI) for the purpose of RCA. In particular, we propose BALANCE (BAyesian Linear AttributioN for root CausE localization), which formulates the problem of RCA through the lens of attribution in XAI and seeks to explain the anomalies in the target KPIs by the behavior of the candidate root causes. BALANCE consists of three innovative components. First, we propose a Bayesian multicollinear feature selection (BMFS) model to predict the target KPIs given the candidate root causes in a forward manner while promoting sparsity and concurrently paying attention to the correlation between the candidate root causes. Second, we introduce attribution analysis to compute the attribution score for each candidate in a backward manner. Third, we merge the estimated root causes related to each KPI if there are multiple KPIs. We extensively evaluate the proposed BALANCE method on one synthesis dataset as well as three real-world RCA tasks, that is, bad SQL localization, container fault localization, and fault type diagnosis for Exathlon. Results show that BALANCE outperforms the state-of-the-art (SOTA) methods in terms of accuracy with the least amount of running time, and achieves at least $6\%$ notably higher accuracy than SOTA methods for real tasks. BALANCE has been deployed to production to tackle real-world RCA problems, and the online results further advocate its usage for real-time diagnosis in distributed data systems.

Related papers

RADICE: Causal Graph Based Root Cause Analysis for System Performance Diagnostic [3.708415881042821]
Root cause analysis is one of the most crucial operations in software reliability regarding system performance diagnostic. We present a novel causal domain knowledge model representing causal relations about the underlying system components. We then introduce RADICE, an algorithm that through the causal graph discovery, enhancement, refinement, and subtraction processes is able to output a root cause causal sub-graph.
arXiv Detail & Related papers (2025-01-20T15:36:39Z)
PORCA: Root Cause Analysis with Partially Observed Data [15.007249208547885]
Root Cause Analysis (RCA) aims at identifying the underlying causes of system faults by uncovering and analyzing the causal structure from complex systems. Previous studies implicitly assume a full observation of the system, which neglect the effect of partial observation. We propose PORCA, a novel RCA framework which can explore reliable root causes under both unobserved confounders and unobserved heterogeneity.
arXiv Detail & Related papers (2024-07-08T12:31:12Z)
KGroot: Enhancing Root Cause Analysis through Knowledge Graphs and Graph Convolutional Neural Networks [14.336830860792707]
KGroot uses event knowledge and the correlation between events to perform root cause reasoning. Experiments demonstrate KGroot can locate the root cause with accuracy of 93.5% top 3 potential causes in second-level.
arXiv Detail & Related papers (2024-02-11T10:30:38Z)
Multi-modal Causal Structure Learning and Root Cause Analysis [67.67578590390907]
We propose Mulan, a unified multi-modal causal structure learning method for root cause localization. We leverage a log-tailored language model to facilitate log representation learning, converting log sequences into time-series data. We also introduce a novel key performance indicator-aware attention mechanism for assessing modality reliability and co-learning a final causal graph.
arXiv Detail & Related papers (2024-02-04T05:50:38Z)
Understanding, Predicting and Better Resolving Q-Value Divergence in Offline-RL [86.0987896274354]
We first identify a fundamental pattern, self-excitation, as the primary cause of Q-value estimation divergence in offline RL. We then propose a novel Self-Excite Eigenvalue Measure (SEEM) metric to measure the evolving property of Q-network at training. For the first time, our theory can reliably decide whether the training will diverge at an early stage.
arXiv Detail & Related papers (2023-10-06T17:57:44Z)
Causal Disentanglement Hidden Markov Model for Fault Diagnosis [55.90917958154425]
We propose a Causal Disentanglement Hidden Markov model (CDHM) to learn the causality in the bearing fault mechanism. Specifically, we make full use of the time-series data and progressively disentangle the vibration signal into fault-relevant and fault-irrelevant factors. To expand the scope of the application, we adopt unsupervised domain adaptation to transfer the learned disentangled representations to other working environments.
arXiv Detail & Related papers (2023-08-06T05:58:45Z)
Disentangled Causal Graph Learning for Online Unsupervised Root Cause Analysis [49.910053255238566]
Root cause analysis (RCA) can identify the root causes of system faults/failures by analyzing system monitoring data. Previous research has mostly focused on developing offline RCA algorithms, which often require manually initiating the RCA process. We propose CORAL, a novel online RCA framework that can automatically trigger the RCA process and incrementally update the RCA model.
arXiv Detail & Related papers (2023-05-18T01:27:48Z)
Hierarchical Graph Neural Networks for Causal Discovery and Root Cause Localization [52.72490784720227]
REASON consists of Topological Causal Discovery and Individual Causal Discovery. The Topological Causal Discovery component aims to model the fault propagation in order to trace back to the root causes. The Individual Causal Discovery component focuses on capturing abrupt change patterns of a single system entity.
arXiv Detail & Related papers (2023-02-03T20:17:45Z)
Causal Inference-Based Root Cause Analysis for Online Service Systems with Intervention Recognition [11.067832313491449]
In this paper, we formulate the root cause analysis problem as a new causal inference task named intervention recognition. We propose a novel unsupervised causal inference-based method named Causal Inference-based Root Cause Analysis (CIRCA) The performance on a real-world dataset shows that CIRCA can improve the recall of the top-1 recommendation by 25% over the best baseline method.
arXiv Detail & Related papers (2022-06-13T01:45:13Z)
Learning Dependencies in Distributed Cloud Applications to Identify and Localize Anomalies [58.88325379746632]
We present Arvalus and its variant D-Arvalus, a neural graph transformation method that models system components as nodes and their dependencies as edges to improve the identification and localization of anomalies. Given a series of metric, our method predicts the most likely system state - either normal or an anomaly class - and performs localization when an anomaly is detected. The evaluation shows the generally good prediction performance of Arvalus and reveals the advantage of D-Arvalus which incorporates information about system component dependencies.
arXiv Detail & Related papers (2021-03-09T06:34:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.