CMMD: Cross-Metric Multi-Dimensional Root Cause Analysis
- URL: http://arxiv.org/abs/2203.16280v1
- Date: Wed, 30 Mar 2022 13:17:19 GMT
- Title: CMMD: Cross-Metric Multi-Dimensional Root Cause Analysis
- Authors: Shifu Yan, Caihua Shan, Wenyi Yang, Bixiong Xu, Dongsheng Li, Lili
Qiu, Jie Tong, Qi Zhang
- Abstract summary: In large-scale online services, crucial metrics, a.k.a., key performance indicators (KPIs) are monitored periodically to check their running statuses.
Once abnormal values are observed, root cause analysis (RCA) can be applied to identify the reasons for anomalies.
We propose a cross-metric multi-dimensional root cause analysis method, named CMMD, which consists of two key components.
- Score: 17.755405467437637
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In large-scale online services, crucial metrics, a.k.a., key performance
indicators (KPIs), are monitored periodically to check their running statuses.
Generally, KPIs are aggregated along multiple dimensions and derived by complex
calculations among fundamental metrics from the raw data. Once abnormal KPI
values are observed, root cause analysis (RCA) can be applied to identify the
reasons for anomalies, so that we can troubleshoot quickly. Recently, several
automatic RCA techniques were proposed to localize the related dimensions (or a
combination of dimensions) to explain the anomalies. However, their analyses
are limited to the data on the abnormal metric and ignore the data of other
metrics which may be also related to the anomalies, leading to imprecise or
even incorrect root causes. To this end, we propose a cross-metric
multi-dimensional root cause analysis method, named CMMD, which consists of two
key components: 1) relationship modeling, which utilizes graph neural network
(GNN) to model the unknown complex calculation among metrics and aggregation
function among dimensions from historical data; 2) root cause localization,
which adopts the genetic algorithm to efficiently and effectively dive into the
raw data and localize the abnormal dimension(s) once the KPI anomalies are
detected. Experiments on synthetic datasets, public datasets and online
production environment demonstrate the superiority of our proposed CMMD method
compared with baselines. Currently, CMMD is running as an online service in
Microsoft Azure.
Related papers
- FC-ADL: Efficient Microservice Anomaly Detection and Localisation Through Functional Connectivity [2.994962964425238]
We propose FC-ADL, an end-to-end scalable approach for detecting and localising anomalous changes from microservice metrics.<n>We show that our approach can achieve top detection and localisation performance across a wide degree of different fault scenarios.<n>We demonstrate that our approach can achieve top detection and localisation performance across a wide degree of different fault scenarios when compared to state-of-the-art approaches.
arXiv Detail & Related papers (2025-11-30T11:29:30Z) - Isolation-based Spherical Ensemble Representations for Anomaly Detection [60.989157958972356]
Anomaly detection is a critical task in data mining and management with applications spanning fraud detection, network security, and log monitoring.<n>Existing unsupervised anomaly detection methods face fundamental challenges including conflicting distributional assumptions, computational inefficiency, and difficulty handling different anomaly types.<n>We propose ISER (Isolation-based Spherical Ensemble Representations) that extends existing isolation-based methods by using hypersphere radii as proxies for local density characteristics while maintaining linear time and constant space complexity.
arXiv Detail & Related papers (2025-10-15T09:00:05Z) - Robust Root Cause Diagnosis using In-Distribution Interventions [31.19149413954674]
Diagnosing the root cause of an anomaly in a complex interconnected system is a pressing problem in today's cloud services and industrial operations.<n>We propose In-Distribution Interventions (IDI), a novel algorithm that predicts root cause as nodes that meet two criteria.
arXiv Detail & Related papers (2025-05-02T00:19:43Z) - Enhancing Web Service Anomaly Detection via Fine-grained Multi-modal Association and Frequency Domain Analysis [8.860339665670255]
Anomaly detection is crucial for ensuring the stability and reliability of web service systems.
Existing anomaly detection methods use logs and metrics to detect anomalies.
We propose a novel anomaly detection method named FFAD to address these two issues.
arXiv Detail & Related papers (2025-01-28T12:00:45Z) - Online Multi-modal Root Cause Analysis [61.94987309148539]
Root Cause Analysis (RCA) is essential for pinpointing the root causes of failures in microservice systems.
Existing online RCA methods handle only single-modal data overlooking, complex interactions in multi-modal systems.
We introduce OCEAN, a novel online multi-modal causal structure learning method for root cause localization.
arXiv Detail & Related papers (2024-10-13T21:47:36Z) - Multi-modal Causal Structure Learning and Root Cause Analysis [67.67578590390907]
We propose Mulan, a unified multi-modal causal structure learning method for root cause localization.
We leverage a log-tailored language model to facilitate log representation learning, converting log sequences into time-series data.
We also introduce a novel key performance indicator-aware attention mechanism for assessing modality reliability and co-learning a final causal graph.
arXiv Detail & Related papers (2024-02-04T05:50:38Z) - Root Cause Analysis In Microservice Using Neural Granger Causal
Discovery [12.35924469567586]
We propose RUN, a novel approach for root cause analysis using neural Granger causal discovery with contrastive learning.
RUN enhances the backbone encoder by integrating contextual information from time series, and leverages a time series forecasting model to conduct neural Granger causal discovery.
In addition, RUN incorporates Pagerank with a vector to efficiently recommend the top-k root causes.
arXiv Detail & Related papers (2024-02-02T04:43:06Z) - Unraveling the "Anomaly" in Time Series Anomaly Detection: A
Self-supervised Tri-domain Solution [89.16750999704969]
Anomaly labels hinder traditional supervised models in time series anomaly detection.
Various SOTA deep learning techniques, such as self-supervised learning, have been introduced to tackle this issue.
We propose a novel self-supervised learning based Tri-domain Anomaly Detector (TriAD)
arXiv Detail & Related papers (2023-11-19T05:37:18Z) - Practical Anomaly Detection over Multivariate Monitoring Metrics for
Online Services [29.37493773435177]
CMAnomaly is an anomaly detection framework on multivariate monitoring metrics based on collaborative machine.
The proposed framework is extensively evaluated with both public data and industrial data collected from a large-scale online service system of Huawei Cloud.
Compared with state-of-the-art baseline models, CMAnomaly achieves an average F1 score of 0.9494, outperforming baselines by 6.77% to 10.68%, and runs 10X to 20X faster.
arXiv Detail & Related papers (2023-08-19T08:08:05Z) - Beyond Sharing: Conflict-Aware Multivariate Time Series Anomaly
Detection [18.796225184893874]
We introduce CAD, a Conflict-aware Anomaly Detection algorithm.
We find that the poor performance of vanilla MMoE mainly comes from the input-output misalignment settings of MTS formulation.
We show that CAD obtains an average F1-score of 0.943 across three public datasets, notably outperforming state-of-the-art methods.
arXiv Detail & Related papers (2023-08-17T11:00:01Z) - Generic and Robust Root Cause Localization for Multi-Dimensional Data in
Online Service Systems [22.308016571592105]
Localizing root causes for multi-dimensional data is critical to ensure online service systems' reliability.
This paper proposes a generic and robust root cause localization approach for multi-dimensional data, PSqueeze.
Case studies in several production systems demonstrate that PSqueeze is helpful to fault diagnosis in the real world.
arXiv Detail & Related papers (2023-05-05T07:22:30Z) - Causality-Based Multivariate Time Series Anomaly Detection [63.799474860969156]
We formulate the anomaly detection problem from a causal perspective and view anomalies as instances that do not follow the regular causal mechanism to generate the multivariate data.
We then propose a causality-based anomaly detection approach, which first learns the causal structure from data and then infers whether an instance is an anomaly relative to the local causal mechanism.
We evaluate our approach with both simulated and public datasets as well as a case study on real-world AIOps applications.
arXiv Detail & Related papers (2022-06-30T06:00:13Z) - CSCAD: Correlation Structure-based Collective Anomaly Detection in
Complex System [11.739889613196619]
We propose a correlation structure-based collective anomaly detection model for high-dimensional anomaly detection problem in large systems.
Our framework utilize graph convolutional network combining a variational autoencoder to jointly exploit the feature space correlation and reconstruction deficiency of samples.
An anomaly discriminating network can then be trained using low anomalous degree samples as positive samples, and high anomalous degree samples as negative samples.
arXiv Detail & Related papers (2021-05-30T09:28:25Z) - TELESTO: A Graph Neural Network Model for Anomaly Classification in
Cloud Services [77.454688257702]
Machine learning (ML) and artificial intelligence (AI) are applied on IT system operation and maintenance.
One direction aims at the recognition of re-occurring anomaly types to enable remediation automation.
We propose a method that is invariant to dimensionality changes of given data.
arXiv Detail & Related papers (2021-02-25T14:24:49Z) - TadGAN: Time Series Anomaly Detection Using Generative Adversarial
Networks [73.01104041298031]
TadGAN is an unsupervised anomaly detection approach built on Generative Adversarial Networks (GANs)
To capture the temporal correlations of time series, we use LSTM Recurrent Neural Networks as base models for Generators and Critics.
To demonstrate the performance and generalizability of our approach, we test several anomaly scoring techniques and report the best-suited one.
arXiv Detail & Related papers (2020-09-16T15:52:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.