Related papers: Root Cause Analysis In Microservice Using Neural Granger Causal Discovery

Root Cause Analysis In Microservice Using Neural Granger Causal Discovery

URL: http://arxiv.org/abs/2402.01140v1
Date: Fri, 2 Feb 2024 04:43:06 GMT
Title: Root Cause Analysis In Microservice Using Neural Granger Causal Discovery
Authors: Cheng-Ming Lin, Ching Chang, Wei-Yao Wang, Kuang-Da Wang, Wen-Chih Peng
Abstract summary: We propose RUN, a novel approach for root cause analysis using neural Granger causal discovery with contrastive learning. RUN enhances the backbone encoder by integrating contextual information from time series, and leverages a time series forecasting model to conduct neural Granger causal discovery. In addition, RUN incorporates Pagerank with a vector to efficiently recommend the top-k root causes.
Score: 12.35924469567586
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: In recent years, microservices have gained widespread adoption in IT operations due to their scalability, maintenance, and flexibility. However, it becomes challenging for site reliability engineers (SREs) to pinpoint the root cause due to the complex relationships in microservices when facing system malfunctions. Previous research employed structured learning methods (e.g., PC-algorithm) to establish causal relationships and derive root causes from causal graphs. Nevertheless, they ignored the temporal order of time series data and failed to leverage the rich information inherent in the temporal relationships. For instance, in cases where there is a sudden spike in CPU utilization, it can lead to an increase in latency for other microservices. However, in this scenario, the anomaly in CPU utilization occurs before the latency increase, rather than simultaneously. As a result, the PC-algorithm fails to capture such characteristics. To address these challenges, we propose RUN, a novel approach for root cause analysis using neural Granger causal discovery with contrastive learning. RUN enhances the backbone encoder by integrating contextual information from time series, and leverages a time series forecasting model to conduct neural Granger causal discovery. In addition, RUN incorporates Pagerank with a personalization vector to efficiently recommend the top-k root causes. Extensive experiments conducted on the synthetic and real-world microservice-based datasets demonstrate that RUN noticeably outperforms the state-of-the-art root cause analysis methods. Moreover, we provide an analysis scenario for the sock-shop case to showcase the practicality and efficacy of RUN in microservice-based applications. Our code is publicly available at https://github.com/zmlin1998/RUN.

Related papers

Causal AI-based Root Cause Identification: Research to Practice at Scale [2.455633941531165]
We have developed a novel causality-based Root Cause Identification (RCI) algorithm that emphasizes causation over correlation. This paper highlights Instana's advanced failure diagnosis capabilities, discussing both the theoretical underpinnings and practical implementations of the RCI algorithm.
arXiv Detail & Related papers (2025-02-25T14:20:33Z)
Online Multi-modal Root Cause Analysis [61.94987309148539]
Root Cause Analysis (RCA) is essential for pinpointing the root causes of failures in microservice systems. Existing online RCA methods handle only single-modal data overlooking, complex interactions in multi-modal systems. We introduce OCEAN, a novel online multi-modal causal structure learning method for root cause localization.
arXiv Detail & Related papers (2024-10-13T21:47:36Z)
CAnDOIT: Causal Discovery with Observational and Interventional Data from Time-Series [4.008958683836471]
CAnDOIT is a causal discovery method to reconstruct causal models using both observational and interventional data. The use of interventional data in the causal analysis is crucial for real-world applications, such as robotics. A Python implementation of CAnDOIT has also been developed and is publicly available on GitHub.
arXiv Detail & Related papers (2024-10-03T13:57:08Z)
Multi-modal Causal Structure Learning and Root Cause Analysis [67.67578590390907]
We propose Mulan, a unified multi-modal causal structure learning method for root cause localization. We leverage a log-tailored language model to facilitate log representation learning, converting log sequences into time-series data. We also introduce a novel key performance indicator-aware attention mechanism for assessing modality reliability and co-learning a final causal graph.
arXiv Detail & Related papers (2024-02-04T05:50:38Z)
Disentangled Causal Graph Learning for Online Unsupervised Root Cause Analysis [49.910053255238566]
Root cause analysis (RCA) can identify the root causes of system faults/failures by analyzing system monitoring data. Previous research has mostly focused on developing offline RCA algorithms, which often require manually initiating the RCA process. We propose CORAL, a novel online RCA framework that can automatically trigger the RCA process and incrementally update the RCA model.
arXiv Detail & Related papers (2023-05-18T01:27:48Z)
CUTS+: High-dimensional Causal Discovery from Irregular Time-series [13.84185941100574]
We propose CUTS+, which is built on the Granger-causality-based causal discovery method CUTS. We show that CUTS+ largely improves the causal discovery performance on high-dimensional data with different types of irregular sampling.
arXiv Detail & Related papers (2023-05-10T04:20:36Z)
CUTS: Neural Causal Discovery from Irregular Time-Series Data [27.06531262632836]
Causal discovery from time-series data has been a central task in machine learning. We present CUTS, a neural Granger causal discovery algorithm to jointly impute unobserved data points and build causal graphs. Our approach constitutes a promising step towards applying causal discovery to real applications with non-ideal observations.
arXiv Detail & Related papers (2023-02-15T04:16:34Z)
Mining Root Cause Knowledge from Cloud Service Incident Investigations for AIOps [71.12026848664753]
Root Cause Analysis (RCA) of any service-disrupting incident is one of the most critical as well as complex tasks in IT processes. In this work, we present ICA and the downstream Incident Search and Retrieval based RCA pipeline, built at Salesforce.
arXiv Detail & Related papers (2022-04-21T02:33:34Z)
Reducing Catastrophic Forgetting in Self Organizing Maps with Internally-Induced Generative Replay [67.50637511633212]
A lifelong learning agent is able to continually learn from potentially infinite streams of pattern sensory data. One major historic difficulty in building agents that adapt is that neural systems struggle to retain previously-acquired knowledge when learning from new samples. This problem is known as catastrophic forgetting (interference) and remains an unsolved problem in the domain of machine learning to this day.
arXiv Detail & Related papers (2021-12-09T07:11:14Z)
An Influence-based Approach for Root Cause Alarm Discovery in Telecom Networks [7.438302177990416]
In practice, accurate and self-adjustable alarm root cause analysis is a great challenge due to network complexity and vast amounts of alarms. We propose a data-driven framework for root cause alarm localization, combining both causal inference and network embedding techniques. We evaluate our method on artificial data and real-world telecom data, showing a significant improvement over the best baselines.
arXiv Detail & Related papers (2021-05-07T07:41:46Z)
Consistency of mechanistic causal discovery in continuous-time using Neural ODEs [85.7910042199734]
We consider causal discovery in continuous-time for the study of dynamical systems. We propose a causal discovery algorithm based on penalized Neural ODEs.
arXiv Detail & Related papers (2021-05-06T08:48:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.