RADICE: Causal Graph Based Root Cause Analysis for System Performance Diagnostic
- URL: http://arxiv.org/abs/2501.11545v1
- Date: Mon, 20 Jan 2025 15:36:39 GMT
- Title: RADICE: Causal Graph Based Root Cause Analysis for System Performance Diagnostic
- Authors: Andrea Tonon, Meng Zhang, Bora Caglayan, Fei Shen, Tong Gui, MingXue Wang, Rong Zhou,
- Abstract summary: Root cause analysis is one of the most crucial operations in software reliability regarding system performance diagnostic.
We present a novel causal domain knowledge model representing causal relations about the underlying system components.
We then introduce RADICE, an algorithm that through the causal graph discovery, enhancement, refinement, and subtraction processes is able to output a root cause causal sub-graph.
- Score: 3.708415881042821
- License:
- Abstract: Root cause analysis is one of the most crucial operations in software reliability regarding system performance diagnostic. It aims to identify the root causes of system performance anomalies, allowing the resolution or the future prevention of issues that can cause millions of dollars in losses. Common existing approaches relying on data correlation or full domain expert knowledge are inaccurate or infeasible in most industrial cases, since correlation does not imply causation, and domain experts may not have full knowledge of complex and real-time systems. In this work, we define a novel causal domain knowledge model representing causal relations about the underlying system components to allow domain experts to contribute partial domain knowledge for root cause analysis. We then introduce RADICE, an algorithm that through the causal graph discovery, enhancement, refinement, and subtraction processes is able to output a root cause causal sub-graph showing the causal relations between the system components affected by the anomaly. We evaluated RADICE with simulated data and reported a real data use case, sharing the lessons we learned. The experiments show that RADICE provides better results than other baseline methods, including causal discovery algorithms and correlation based approaches for root cause analysis.
Related papers
- Online Multi-modal Root Cause Analysis [61.94987309148539]
Root Cause Analysis (RCA) is essential for pinpointing the root causes of failures in microservice systems.
Existing online RCA methods handle only single-modal data overlooking, complex interactions in multi-modal systems.
We introduce OCEAN, a novel online multi-modal causal structure learning method for root cause localization.
arXiv Detail & Related papers (2024-10-13T21:47:36Z) - Root Cause Analysis for Microservice System based on Causal Inference: How Far Are We? [11.627235799040388]
We conduct a comprehensive evaluation of causal inference-based root cause analysis methods for microservice systems.
No method stands out in all situations; each method tends to either fall short in effectiveness, efficiency, or shows sensitivity to specific parameters.
arXiv Detail & Related papers (2024-08-25T05:53:42Z) - PORCA: Root Cause Analysis with Partially Observed Data [15.007249208547885]
Root Cause Analysis (RCA) aims at identifying the underlying causes of system faults by uncovering and analyzing the causal structure from complex systems.
Previous studies implicitly assume a full observation of the system, which neglect the effect of partial observation.
We propose PORCA, a novel RCA framework which can explore reliable root causes under both unobserved confounders and unobserved heterogeneity.
arXiv Detail & Related papers (2024-07-08T12:31:12Z) - On the Fly Detection of Root Causes from Observed Data with Application to IT Systems [3.3321350585823826]
This paper introduces a new structural causal model tailored for representing threshold-based IT systems.
It presents a new algorithm designed to rapidly detect root causes of anomalies in such systems.
arXiv Detail & Related papers (2024-02-09T16:10:19Z) - Multi-modal Causal Structure Learning and Root Cause Analysis [67.67578590390907]
We propose Mulan, a unified multi-modal causal structure learning method for root cause localization.
We leverage a log-tailored language model to facilitate log representation learning, converting log sequences into time-series data.
We also introduce a novel key performance indicator-aware attention mechanism for assessing modality reliability and co-learning a final causal graph.
arXiv Detail & Related papers (2024-02-04T05:50:38Z) - PyRCA: A Library for Metric-based Root Cause Analysis [66.72542200701807]
PyRCA is an open-source machine learning library of Root Cause Analysis (RCA) for Artificial Intelligence for IT Operations (AIOps)
It provides a holistic framework to uncover the complicated metric causal dependencies and automatically locate root causes of incidents.
arXiv Detail & Related papers (2023-06-20T09:55:10Z) - Disentangled Causal Graph Learning for Online Unsupervised Root Cause
Analysis [49.910053255238566]
Root cause analysis (RCA) can identify the root causes of system faults/failures by analyzing system monitoring data.
Previous research has mostly focused on developing offline RCA algorithms, which often require manually initiating the RCA process.
We propose CORAL, a novel online RCA framework that can automatically trigger the RCA process and incrementally update the RCA model.
arXiv Detail & Related papers (2023-05-18T01:27:48Z) - DOMINO: Visual Causal Reasoning with Time-Dependent Phenomena [59.291745595756346]
We propose a set of visual analytics methods that allow humans to participate in the discovery of causal relations associated with windows of time delay.
Specifically, we leverage a well-established method, logic-based causality, to enable analysts to test the significance of potential causes.
Since an effect can be a cause of other effects, we allow users to aggregate different temporal cause-effect relations found with our method into a visual flow diagram.
arXiv Detail & Related papers (2023-03-12T03:40:21Z) - Hierarchical Graph Neural Networks for Causal Discovery and Root Cause
Localization [52.72490784720227]
REASON consists of Topological Causal Discovery and Individual Causal Discovery.
The Topological Causal Discovery component aims to model the fault propagation in order to trace back to the root causes.
The Individual Causal Discovery component focuses on capturing abrupt change patterns of a single system entity.
arXiv Detail & Related papers (2023-02-03T20:17:45Z) - Causal Inference-Based Root Cause Analysis for Online Service Systems
with Intervention Recognition [11.067832313491449]
In this paper, we formulate the root cause analysis problem as a new causal inference task named intervention recognition.
We propose a novel unsupervised causal inference-based method named Causal Inference-based Root Cause Analysis (CIRCA)
The performance on a real-world dataset shows that CIRCA can improve the recall of the top-1 recommendation by 25% over the best baseline method.
arXiv Detail & Related papers (2022-06-13T01:45:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.