BARO: Robust Root Cause Analysis for Microservices via Multivariate Bayesian Online Change Point Detection
- URL: http://arxiv.org/abs/2405.09330v1
- Date: Wed, 15 May 2024 13:32:59 GMT
- Title: BARO: Robust Root Cause Analysis for Microservices via Multivariate Bayesian Online Change Point Detection
- Authors: Luan Pham, Huong Ha, Hongyu Zhang,
- Abstract summary: We propose an end-to-end approach that integrates anomaly detection and root cause analysis.
BarO consistently outperforms state-of-the-art approaches in both anomaly detection and root cause analysis.
- Score: 11.627235799040388
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Detecting failures and identifying their root causes promptly and accurately is crucial for ensuring the availability of microservice systems. A typical failure troubleshooting pipeline for microservices consists of two phases: anomaly detection and root cause analysis. While various existing works on root cause analysis require accurate anomaly detection, there is no guarantee of accurate estimation with anomaly detection techniques. Inaccurate anomaly detection results can significantly affect the root cause localization results. To address this challenge, we propose BARO, an end-to-end approach that integrates anomaly detection and root cause analysis for effectively troubleshooting failures in microservice systems. BARO leverages the Multivariate Bayesian Online Change Point Detection technique to model the dependency within multivariate time-series metrics data, enabling it to detect anomalies more accurately. BARO also incorporates a novel nonparametric statistical hypothesis testing technique for robustly identifying root causes, which is less sensitive to the accuracy of anomaly detection compared to existing works. Our comprehensive experiments conducted on three popular benchmark microservice systems demonstrate that BARO consistently outperforms state-of-the-art approaches in both anomaly detection and root cause analysis.
Related papers
- Unsupervised Anomaly Detection Using Diffusion Trend Analysis [48.19821513256158]
We propose a method to detect anomalies by analysis of reconstruction trend depending on the degree of degradation.
The proposed method is validated on an open dataset for industrial anomaly detection.
arXiv Detail & Related papers (2024-07-12T01:50:07Z) - Graph Spatiotemporal Process for Multivariate Time Series Anomaly
Detection with Missing Values [67.76168547245237]
We introduce a novel framework called GST-Pro, which utilizes a graphtemporal process and anomaly scorer to detect anomalies.
Our experimental results show that the GST-Pro method can effectively detect anomalies in time series data and outperforms state-of-the-art methods.
arXiv Detail & Related papers (2024-01-11T10:10:16Z) - PULL: Reactive Log Anomaly Detection Based On Iterative PU Learning [58.85063149619348]
We propose PULL, an iterative log analysis method for reactive anomaly detection based on estimated failure time windows.
Our evaluation shows that PULL consistently outperforms ten benchmark baselines across three different datasets.
arXiv Detail & Related papers (2023-01-25T16:34:43Z) - Causality-Based Multivariate Time Series Anomaly Detection [63.799474860969156]
We formulate the anomaly detection problem from a causal perspective and view anomalies as instances that do not follow the regular causal mechanism to generate the multivariate data.
We then propose a causality-based anomaly detection approach, which first learns the causal structure from data and then infers whether an instance is an anomaly relative to the local causal mechanism.
We evaluate our approach with both simulated and public datasets as well as a case study on real-world AIOps applications.
arXiv Detail & Related papers (2022-06-30T06:00:13Z) - An Evaluation of Anomaly Detection and Diagnosis in Multivariate Time
Series [7.675917669905486]
This paper presents a systematic and comprehensive evaluation of unsupervised and semi-supervised deep-learning based methods for anomaly detection and diagnosis.
We vary the model and post-processing of model errors, through a grid of 10 models and 4 scoring functions, comparing these variants to state of the art methods.
We find that the existing evaluation metrics either do not take events into account, or cannot distinguish between a good detector and trivial detectors.
arXiv Detail & Related papers (2021-09-23T15:14:24Z) - A Survey on Anomaly Detection for Technical Systems using LSTM Networks [0.0]
Anomalies represent deviations from the intended system operation and can lead to decreased efficiency as well as partial or complete system failure.
In this article, a survey on state-of-the-art anomaly detection using deep neural and especially long short-term memory networks is conducted.
The investigated approaches are evaluated based on the application scenario, data and anomaly types as well as further metrics.
arXiv Detail & Related papers (2021-05-28T13:24:40Z) - An Explainable Artificial Intelligence Approach for Unsupervised Fault
Detection and Diagnosis in Rotating Machinery [2.055054374525828]
This paper proposes a new approach for fault detection and diagnosis in rotating machinery.
The methodology consists of three parts: feature extraction, fault detection and fault diagnosis.
The effectiveness of the proposed approach is shown on three datasets containing different mechanical faults.
arXiv Detail & Related papers (2021-02-23T18:28:18Z) - TadGAN: Time Series Anomaly Detection Using Generative Adversarial
Networks [73.01104041298031]
TadGAN is an unsupervised anomaly detection approach built on Generative Adversarial Networks (GANs)
To capture the temporal correlations of time series, we use LSTM Recurrent Neural Networks as base models for Generators and Critics.
To demonstrate the performance and generalizability of our approach, we test several anomaly scoring techniques and report the best-suited one.
arXiv Detail & Related papers (2020-09-16T15:52:04Z) - NADS: Neural Architecture Distribution Search for Uncertainty Awareness [79.18710225716791]
Machine learning (ML) systems often encounter Out-of-Distribution (OoD) errors when dealing with testing data coming from a distribution different from training data.
Existing OoD detection approaches are prone to errors and even sometimes assign higher likelihoods to OoD samples.
We propose Neural Architecture Distribution Search (NADS) to identify common building blocks among all uncertainty-aware architectures.
arXiv Detail & Related papers (2020-06-11T17:39:07Z) - Root Cause Detection Among Anomalous Time Series Using Temporal State
Alignment [0.0]
We propose a method that isolates the root cause of an anomaly by analyzing the patterns in time series fluctuations.
The idea is to track the propagation of the effect when a problem causes unaligned but homogeneous shifts of the underlying states.
We evaluate our approach by finding the root cause of anomalies in Zillows clickstream data by identifying causal patterns among a set of observed fluctuations.
arXiv Detail & Related papers (2020-01-04T08:31:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.