Graph-based Incident Aggregation for Large-Scale Online Service Systems
- URL: http://arxiv.org/abs/2108.12179v1
- Date: Fri, 27 Aug 2021 08:48:55 GMT
- Title: Graph-based Incident Aggregation for Large-Scale Online Service Systems
- Authors: Zhuangbin Chen, Jinyang Liu, Yuxin Su, Hongyu Zhang, Xuemin Wen, Xiao
Ling, Yongqiang Yang, Michael R. Lyu
- Abstract summary: We propose GRLIA, an incident aggregation framework based on graph representation learning over the cascading graph of cloud failures.
A representation vector is learned for each unique type of incident in an unsupervised and unified manner, which is able to simultaneously encode the topological and temporal correlations.
The proposed framework is evaluated with real-world incident data collected from a large-scale online service system of Huawei Cloud.
- Score: 33.70557954446136
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: As online service systems continue to grow in terms of complexity and volume,
how service incidents are managed will significantly impact company revenue and
user trust. Due to the cascading effect, cloud failures often come with an
overwhelming number of incidents from dependent services and devices. To pursue
efficient incident management, related incidents should be quickly aggregated
to narrow down the problem scope. To this end, in this paper, we propose GRLIA,
an incident aggregation framework based on graph representation learning over
the cascading graph of cloud failures. A representation vector is learned for
each unique type of incident in an unsupervised and unified manner, which is
able to simultaneously encode the topological and temporal correlations among
incidents. Thus, it can be easily employed for online incident aggregation. In
particular, to learn the correlations more accurately, we try to recover the
complete scope of failures' cascading impact by leveraging fine-grained system
monitoring data, i.e., Key Performance Indicators (KPIs). The proposed
framework is evaluated with real-world incident data collected from a
large-scale online service system of Huawei Cloud. The experimental results
demonstrate that GRLIA is effective and outperforms existing methods.
Furthermore, our framework has been successfully deployed in industrial
practice.
Related papers
- CHASE: A Causal Heterogeneous Graph based Framework for Root Cause Analysis in Multimodal Microservice Systems [22.00860661894853]
We propose a Causal Heterogeneous grAph baSed framEwork for root cause analysis, namely CHASE, for microservice systems with multimodal data.
CHASE learns from the constructed hypergraph with hyperedges representing the flow of causality and performs root cause localization.
arXiv Detail & Related papers (2024-06-28T07:46:51Z) - KGroot: Enhancing Root Cause Analysis through Knowledge Graphs and Graph
Convolutional Neural Networks [14.336830860792707]
KGroot uses event knowledge and the correlation between events to perform root cause reasoning.
Experiments demonstrate KGroot can locate the root cause with accuracy of 93.5% top 3 potential causes in second-level.
arXiv Detail & Related papers (2024-02-11T10:30:38Z) - Dependency Aware Incident Linking in Large Cloud Systems [8.797638977934646]
We propose dependency-aware incident linking (DiLink) framework to improve the accuracy and coverage of incident links.
We also propose a novel method to align the embeddings of multi-modal (i.e., textual and graphical) data using Orthogonal Procrustes.
arXiv Detail & Related papers (2024-02-05T13:54:11Z) - Multi-modal Causal Structure Learning and Root Cause Analysis [67.67578590390907]
We propose Mulan, a unified multi-modal causal structure learning method for root cause localization.
We leverage a log-tailored language model to facilitate log representation learning, converting log sequences into time-series data.
We also introduce a novel key performance indicator-aware attention mechanism for assessing modality reliability and co-learning a final causal graph.
arXiv Detail & Related papers (2024-02-04T05:50:38Z) - Causality is all you need [63.10680366545293]
Causal Graph Routing (CGR) is an integrated causal scheme relying entirely on the intervention mechanisms to reveal the cause-effect forces hidden in data.
CGR can surpass the current state-of-the-art methods on both Visual Question Answer and Long Document Classification tasks.
arXiv Detail & Related papers (2023-11-21T02:53:40Z) - Practical Anomaly Detection over Multivariate Monitoring Metrics for
Online Services [29.37493773435177]
CMAnomaly is an anomaly detection framework on multivariate monitoring metrics based on collaborative machine.
The proposed framework is extensively evaluated with both public data and industrial data collected from a large-scale online service system of Huawei Cloud.
Compared with state-of-the-art baseline models, CMAnomaly achieves an average F1 score of 0.9494, outperforming baselines by 6.77% to 10.68%, and runs 10X to 20X faster.
arXiv Detail & Related papers (2023-08-19T08:08:05Z) - Identifying Performance Issues in Cloud Service Systems Based on Relational-Temporal Features [11.83269525626691]
Cloud systems are susceptible to performance issues, which may cause service-level agreement violations and financial losses.
We propose a learning-based approach that leverages both the relational and temporal features of metrics to identify performance issues.
arXiv Detail & Related papers (2023-07-20T13:41:26Z) - FIRE: A Failure-Adaptive Reinforcement Learning Framework for Edge Computing Migrations [52.85536740465277]
FIRE is a framework that adapts to rare events by training a RL policy in an edge computing digital twin environment.
We propose ImRE, an importance sampling-based Q-learning algorithm, which samples rare events proportionally to their impact on the value function.
We show that FIRE reduces costs compared to vanilla RL and the greedy baseline in the event of failures.
arXiv Detail & Related papers (2022-09-28T19:49:39Z) - Relational Graph Neural Networks for Fraud Detection in a Super-App
environment [53.561797148529664]
We propose a framework of relational graph convolutional networks methods for fraudulent behaviour prevention in the financial services of a Super-App.
We use an interpretability algorithm for graph neural networks to determine the most important relations to the classification task of the users.
Our results show that there is an added value when considering models that take advantage of the alternative data of the Super-App and the interactions found in their high connectivity.
arXiv Detail & Related papers (2021-07-29T00:02:06Z) - Information Obfuscation of Graph Neural Networks [96.8421624921384]
We study the problem of protecting sensitive attributes by information obfuscation when learning with graph structured data.
We propose a framework to locally filter out pre-determined sensitive attributes via adversarial training with the total variation and the Wasserstein distance.
arXiv Detail & Related papers (2020-09-28T17:55:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.