Generic and Robust Root Cause Localization for Multi-Dimensional Data in
Online Service Systems
- URL: http://arxiv.org/abs/2305.03331v1
- Date: Fri, 5 May 2023 07:22:30 GMT
- Title: Generic and Robust Root Cause Localization for Multi-Dimensional Data in
Online Service Systems
- Authors: Zeyan Li, Junjie Chen, Yihao Chen, Chengyang Luo, Yiwei Zhao, Yongqian
Sun, Kaixin Sui, Xiping Wang, Dapeng Liu, Xing Jin, Qi Wang, Dan Pei
- Abstract summary: Localizing root causes for multi-dimensional data is critical to ensure online service systems' reliability.
This paper proposes a generic and robust root cause localization approach for multi-dimensional data, PSqueeze.
Case studies in several production systems demonstrate that PSqueeze is helpful to fault diagnosis in the real world.
- Score: 22.308016571592105
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Localizing root causes for multi-dimensional data is critical to ensure
online service systems' reliability. When a fault occurs, only the measure
values within specific attribute combinations are abnormal. Such attribute
combinations are substantial clues to the underlying root causes and thus are
called root causes of multidimensional data. This paper proposes a generic and
robust root cause localization approach for multi-dimensional data, PSqueeze.
We propose a generic property of root cause for multi-dimensional data,
generalized ripple effect (GRE). Based on it, we propose a novel probabilistic
cluster method and a robust heuristic search method. Moreover, we identify the
importance of determining external root causes and propose an effective method
for the first time in literature. Our experiments on two real-world datasets
with 5400 faults show that the F1-score of PSqueeze outperforms baselines by
32.89%, while the localization time is around 10 seconds across all cases. The
F1-score in determining external root causes of PSqueeze achieves 0.90.
Furthermore, case studies in several production systems demonstrate that
PSqueeze is helpful to fault diagnosis in the real world.
Related papers
- Adaptive Root Cause Localization for Microservice Systems with Multi-Agent Recursion-of-Thought [11.307072056343662]
We introduce RCLAgent, an adaptive root cause localization method for microservice systems.<n>We show that RCLAgent achieves superior performance by localizing the root cause using only a single request-outperforming state-of-the-art methods.
arXiv Detail & Related papers (2025-08-28T02:34:19Z) - RADICE: Causal Graph Based Root Cause Analysis for System Performance Diagnostic [3.708415881042821]
Root cause analysis is one of the most crucial operations in software reliability regarding system performance diagnostic.
We present a novel causal domain knowledge model representing causal relations about the underlying system components.
We then introduce RADICE, an algorithm that through the causal graph discovery, enhancement, refinement, and subtraction processes is able to output a root cause causal sub-graph.
arXiv Detail & Related papers (2025-01-20T15:36:39Z) - Multi-modal Causal Structure Learning and Root Cause Analysis [67.67578590390907]
We propose Mulan, a unified multi-modal causal structure learning method for root cause localization.
We leverage a log-tailored language model to facilitate log representation learning, converting log sequences into time-series data.
We also introduce a novel key performance indicator-aware attention mechanism for assessing modality reliability and co-learning a final causal graph.
arXiv Detail & Related papers (2024-02-04T05:50:38Z) - Root Cause Explanation of Outliers under Noisy Mechanisms [50.59446568076628]
Causal processes are often modelled as graphs with entities being nodes and their paths/interconnections as edge.
Existing work only consider the contribution of nodes in the generative process.
We consider both individual edge and node of each mechanism when identifying the root causes.
arXiv Detail & Related papers (2023-12-19T03:24:26Z) - Hierarchical Graph Neural Networks for Causal Discovery and Root Cause
Localization [52.72490784720227]
REASON consists of Topological Causal Discovery and Individual Causal Discovery.
The Topological Causal Discovery component aims to model the fault propagation in order to trace back to the root causes.
The Individual Causal Discovery component focuses on capturing abrupt change patterns of a single system entity.
arXiv Detail & Related papers (2023-02-03T20:17:45Z) - BALANCE: Bayesian Linear Attribution for Root Cause Localization [19.30952654225615]
Root Cause Analysis (RCA) plays an indispensable role in distributed data system maintenance and operations.
This paper opens up the possibilities of exploiting the recently developed framework of explainable AI (XAI) for the purpose of RCA.
We propose BALANCE, which formulates the problem of RCA through the lens of attribution in XAI.
arXiv Detail & Related papers (2023-01-31T11:49:26Z) - Causality-Based Multivariate Time Series Anomaly Detection [63.799474860969156]
We formulate the anomaly detection problem from a causal perspective and view anomalies as instances that do not follow the regular causal mechanism to generate the multivariate data.
We then propose a causality-based anomaly detection approach, which first learns the causal structure from data and then infers whether an instance is an anomaly relative to the local causal mechanism.
We evaluate our approach with both simulated and public datasets as well as a case study on real-world AIOps applications.
arXiv Detail & Related papers (2022-06-30T06:00:13Z) - RiskLoc: Localization of Multi-dimensional Root Causes by Weighted Risk [1.2691047660244335]
Failures and anomalies in large-scale software systems are unavoidable incidents.
Operators need to quickly and correctly identify its location to facilitate a swift repair.
We propose RiskLoc to solve the problem of multidimensional root cause localization.
arXiv Detail & Related papers (2022-05-20T07:43:18Z) - CMMD: Cross-Metric Multi-Dimensional Root Cause Analysis [17.755405467437637]
In large-scale online services, crucial metrics, a.k.a., key performance indicators (KPIs) are monitored periodically to check their running statuses.
Once abnormal values are observed, root cause analysis (RCA) can be applied to identify the reasons for anomalies.
We propose a cross-metric multi-dimensional root cause analysis method, named CMMD, which consists of two key components.
arXiv Detail & Related papers (2022-03-30T13:17:19Z) - An Influence-based Approach for Root Cause Alarm Discovery in Telecom
Networks [7.438302177990416]
In practice, accurate and self-adjustable alarm root cause analysis is a great challenge due to network complexity and vast amounts of alarms.
We propose a data-driven framework for root cause alarm localization, combining both causal inference and network embedding techniques.
We evaluate our method on artificial data and real-world telecom data, showing a significant improvement over the best baselines.
arXiv Detail & Related papers (2021-05-07T07:41:46Z) - Learning Dependencies in Distributed Cloud Applications to Identify and
Localize Anomalies [58.88325379746632]
We present Arvalus and its variant D-Arvalus, a neural graph transformation method that models system components as nodes and their dependencies as edges to improve the identification and localization of anomalies.
Given a series of metric, our method predicts the most likely system state - either normal or an anomaly class - and performs localization when an anomaly is detected.
The evaluation shows the generally good prediction performance of Arvalus and reveals the advantage of D-Arvalus which incorporates information about system component dependencies.
arXiv Detail & Related papers (2021-03-09T06:34:05Z) - TadGAN: Time Series Anomaly Detection Using Generative Adversarial
Networks [73.01104041298031]
TadGAN is an unsupervised anomaly detection approach built on Generative Adversarial Networks (GANs)
To capture the temporal correlations of time series, we use LSTM Recurrent Neural Networks as base models for Generators and Critics.
To demonstrate the performance and generalizability of our approach, we test several anomaly scoring techniques and report the best-suited one.
arXiv Detail & Related papers (2020-09-16T15:52:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.