Related papers: CHASE: A Causal Heterogeneous Graph based Framework for Root Cause Analysis in Multimodal Microservice Systems

CHASE: A Causal Heterogeneous Graph based Framework for Root Cause Analysis in Multimodal Microservice Systems

URL: http://arxiv.org/abs/2406.19711v1
Date: Fri, 28 Jun 2024 07:46:51 GMT
Title: CHASE: A Causal Heterogeneous Graph based Framework for Root Cause Analysis in Multimodal Microservice Systems
Authors: Ziming Zhao, Tiehua Zhang, Zhishu Shen, Hai Dong, Xingjun Ma, Xianhui Liu, Yun Yang,
Abstract summary: We propose a Causal Heterogeneous grAph baSed framEwork for root cause analysis, namely CHASE, for microservice systems with multimodal data. CHASE learns from the constructed hypergraph with hyperedges representing the flow of causality and performs root cause localization.
Score: 22.00860661894853
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In recent years, the widespread adoption of distributed microservice architectures within the industry has significantly increased the demand for enhanced system availability and robustness. Due to the complex service invocation paths and dependencies at enterprise-level microservice systems, it is challenging to locate the anomalies promptly during service invocations, thus causing intractable issues for normal system operations and maintenance. In this paper, we propose a Causal Heterogeneous grAph baSed framEwork for root cause analysis, namely CHASE, for microservice systems with multimodal data, including traces, logs, and system monitoring metrics. Specifically, related information is encoded into representative embeddings and further modeled by a multimodal invocation graph. Following that, anomaly detection is performed on each instance node with attentive heterogeneous message passing from its adjacent metric and log nodes. Finally, CHASE learns from the constructed hypergraph with hyperedges representing the flow of causality and performs root cause localization. We evaluate the proposed framework on two public microservice datasets with distinct attributes and compare with the state-of-the-art methods. The results show that CHASE achieves the average performance gain up to 36.2%(A@1) and 29.4%(Percentage@1), respectively to its best counterpart.

Related papers

GAL-MAD: Towards Explainable Anomaly Detection in Microservice Applications Using Graph Attention Networks [1.0136215038345013]
Anomalies stemming from network and performance issues must be swiftly identified and addressed. Existing anomaly detection techniques often rely on statistical models or machine learning methods. We propose a novel anomaly detection model called Graph Attention and LSTM-based Microservice Anomaly Detection (GAL-MAD)
arXiv Detail & Related papers (2025-03-31T10:11:31Z)
Network Centrality as a New Perspective on Microservice Architecture [48.55946052680251]
The adoption of Microservice Architecture has led to the identification of various patterns and anti-patterns, such as Nano/Mega/Hub services. This study investigates whether centrality metrics (CMs) can provide new insights into MSA quality and facilitate the detection of architectural anti-patterns.
arXiv Detail & Related papers (2025-01-23T10:13:57Z)
Online Multi-modal Root Cause Analysis [61.94987309148539]
Root Cause Analysis (RCA) is essential for pinpointing the root causes of failures in microservice systems. Existing online RCA methods handle only single-modal data overlooking, complex interactions in multi-modal systems. We introduce OCEAN, a novel online multi-modal causal structure learning method for root cause localization.
arXiv Detail & Related papers (2024-10-13T21:47:36Z)
Multi-modal Causal Structure Learning and Root Cause Analysis [67.67578590390907]
We propose Mulan, a unified multi-modal causal structure learning method for root cause localization. We leverage a log-tailored language model to facilitate log representation learning, converting log sequences into time-series data. We also introduce a novel key performance indicator-aware attention mechanism for assessing modality reliability and co-learning a final causal graph.
arXiv Detail & Related papers (2024-02-04T05:50:38Z)
Multitask Active Learning for Graph Anomaly Detection [48.690169078479116]
We propose a novel MultItask acTIve Graph Anomaly deTEction framework, namely MITIGATE. By coupling node classification tasks, MITIGATE obtains the capability to detect out-of-distribution nodes without known anomalies. Empirical studies on four datasets demonstrate that MITIGATE significantly outperforms the state-of-the-art methods for anomaly detection.
arXiv Detail & Related papers (2024-01-24T03:43:45Z)
A Microservices Identification Method Based on Spectral Clustering for Industrial Legacy Systems [5.255685751491305]
We propose an automated microservice decomposition method for extracting microservice candidates based on spectral graph theory. We show that our method can yield favorable results even without the involvement of domain experts.
arXiv Detail & Related papers (2023-12-20T07:47:01Z)
Twin Graph-based Anomaly Detection via Attentive Multi-Modal Learning for Microservice System [24.2074235652359]
We propose MSTGAD, which seamlessly integrates all available data modalities via attentive multi-modal learning. We construct a transformer-based neural network with both spatial and temporal attention mechanisms to model the inter-correlations between different modalities. This enables us to detect anomalies automatically and accurately in real-time.
arXiv Detail & Related papers (2023-10-07T06:28:41Z)
GLAD: Content-aware Dynamic Graphs For Log Anomaly Detection [49.9884374409624]
GLAD is a Graph-based Log Anomaly Detection framework designed to detect anomalies in system logs. We introduce GLAD, a Graph-based Log Anomaly Detection framework designed to detect anomalies in system logs.
arXiv Detail & Related papers (2023-09-12T04:21:30Z)
Practical Anomaly Detection over Multivariate Monitoring Metrics for Online Services [29.37493773435177]
CMAnomaly is an anomaly detection framework on multivariate monitoring metrics based on collaborative machine. The proposed framework is extensively evaluated with both public data and industrial data collected from a large-scale online service system of Huawei Cloud. Compared with state-of-the-art baseline models, CMAnomaly achieves an average F1 score of 0.9494, outperforming baselines by 6.77% to 10.68%, and runs 10X to 20X faster.
arXiv Detail & Related papers (2023-08-19T08:08:05Z)
Robust Multimodal Failure Detection for Microservice Systems [32.25907616511765]
AnoFusion is an unsupervised failure detection approach for microservice systems. It learns the correlation of the heterogeneous multimodal data and integrates a Graph Attention Network (GAT) and Gated Recurrent Unit (GRU) It achieves the F1-score of 0.857 and 0.922, respectively, outperforming state-of-the-art failure detection approaches.
arXiv Detail & Related papers (2023-05-30T12:39:42Z)
Heterogeneous Anomaly Detection for Software Systems via Semi-supervised Cross-modal Attention [29.654681594903114]
We propose Hades, the first end-to-end semi-supervised approach to identify system anomalies based on heterogeneous data. Our approach employs a hierarchical architecture to learn a global representation of the system status by fusing log semantics and metric patterns. We evaluate Hades extensively on large-scale simulated data and datasets from Huawei Cloud.
arXiv Detail & Related papers (2023-02-14T09:02:11Z)
BCDAG: An R package for Bayesian structure and Causal learning of Gaussian DAGs [77.34726150561087]
We introduce the R package for causal discovery and causal effect estimation from observational data. Our implementation scales efficiently with the number of observations and, whenever the DAGs are sufficiently sparse, the number of variables in the dataset. We then illustrate the main functions and algorithms on both real and simulated datasets.
arXiv Detail & Related papers (2022-01-28T09:30:32Z)
A2Log: Attentive Augmented Log Anomaly Detection [53.06341151551106]
Anomaly detection becomes increasingly important for the dependability and serviceability of IT services. Existing unsupervised methods need anomaly examples to obtain a suitable decision boundary. We develop A2Log, which is an unsupervised anomaly detection method consisting of two steps: Anomaly scoring and anomaly decision.
arXiv Detail & Related papers (2021-09-20T13:40:21Z)
Learning Dependencies in Distributed Cloud Applications to Identify and Localize Anomalies [58.88325379746632]
We present Arvalus and its variant D-Arvalus, a neural graph transformation method that models system components as nodes and their dependencies as edges to improve the identification and localization of anomalies. Given a series of metric, our method predicts the most likely system state - either normal or an anomaly class - and performs localization when an anomaly is detected. The evaluation shows the generally good prediction performance of Arvalus and reveals the advantage of D-Arvalus which incorporates information about system component dependencies.
arXiv Detail & Related papers (2021-03-09T06:34:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.