Performance Issue Identification in Cloud Systems with
Relational-Temporal Anomaly Detection
- URL: http://arxiv.org/abs/2307.10869v2
- Date: Tue, 1 Aug 2023 07:04:29 GMT
- Title: Performance Issue Identification in Cloud Systems with
Relational-Temporal Anomaly Detection
- Authors: Wenwei Gu, Jinyang Liu, Zhuangbin Chen, Jianping Zhang, Yuxin Su,
Jiazhen Gu, Cong Feng, Zengyin Yang and Michael Lyu
- Abstract summary: Performance issues permeate large-scale cloud service systems, which can lead to huge revenue losses.
To ensure reliable performance, it's essential to accurately identify these issues using service monitoring metrics.
Some existing methods tackle this problem by analyzing each metric independently to detect anomalies.
- Score: 5.473091770227683
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Performance issues permeate large-scale cloud service systems, which can lead
to huge revenue losses. To ensure reliable performance, it's essential to
accurately identify and localize these issues using service monitoring metrics.
Given the complexity and scale of modern cloud systems, this task can be
challenging and may require extensive expertise and resources beyond the
capacity of individual humans. Some existing methods tackle this problem by
analyzing each metric independently to detect anomalies. However, this could
incur overwhelming alert storms that are difficult for engineers to diagnose
manually. To pursue better performance, not only the temporal patterns of
metrics but also the correlation between metrics (i.e., relational patterns)
should be considered, which can be formulated as a multivariate metrics anomaly
detection problem. However, most of the studies fall short of extracting these
two types of features explicitly. Moreover, there exist some unlabeled
anomalies mixed in the training data, which may hinder the detection
performance. To address these limitations, we propose the Relational- Temporal
Anomaly Detection Model (RTAnomaly) that combines the relational and temporal
information of metrics. RTAnomaly employs a graph attention layer to learn the
dependencies among metrics, which will further help pinpoint the anomalous
metrics that may cause the anomaly effectively. In addition, we exploit the
concept of positive unlabeled learning to address the issue of potential
anomalies in the training data. To evaluate our method, we conduct experiments
on a public dataset and two industrial datasets. RTAnomaly outperforms all the
baseline models by achieving an average F1 score of 0.929 and Hit@3 of 0.920,
demonstrating its superiority.
Related papers
- Learning Unified System Representations for Microservice Tail Latency Prediction [8.532290784939967]
Microservice architectures have become the de facto standard for building scalable cloud-native applications.<n>Traditional approaches often rely on per-request latency metrics, which are highly sensitive to transient noise.<n>We propose USRFNet, a deep learning network that explicitly separates and models traffic-side and resource-side features.
arXiv Detail & Related papers (2025-08-03T07:46:23Z) - GAL-MAD: Towards Explainable Anomaly Detection in Microservice Applications Using Graph Attention Networks [1.0136215038345013]
Anomalies stemming from network and performance issues must be swiftly identified and addressed.
Existing anomaly detection techniques often rely on statistical models or machine learning methods.
We propose a novel anomaly detection model called Graph Attention and LSTM-based Microservice Anomaly Detection (GAL-MAD)
arXiv Detail & Related papers (2025-03-31T10:11:31Z) - Localized Gaussians as Self-Attention Weights for Point Clouds Correspondence [92.07601770031236]
We investigate semantically meaningful patterns in the attention heads of an encoder-only Transformer architecture.
We find that fixing the attention weights not only accelerates the training process but also enhances the stability of the optimization.
arXiv Detail & Related papers (2024-09-20T07:41:47Z) - Multi-modal Causal Structure Learning and Root Cause Analysis [67.67578590390907]
We propose Mulan, a unified multi-modal causal structure learning method for root cause localization.
We leverage a log-tailored language model to facilitate log representation learning, converting log sequences into time-series data.
We also introduce a novel key performance indicator-aware attention mechanism for assessing modality reliability and co-learning a final causal graph.
arXiv Detail & Related papers (2024-02-04T05:50:38Z) - Multitask Active Learning for Graph Anomaly Detection [48.690169078479116]
We propose a novel MultItask acTIve Graph Anomaly deTEction framework, namely MITIGATE.
By coupling node classification tasks, MITIGATE obtains the capability to detect out-of-distribution nodes without known anomalies.
Empirical studies on four datasets demonstrate that MITIGATE significantly outperforms the state-of-the-art methods for anomaly detection.
arXiv Detail & Related papers (2024-01-24T03:43:45Z) - GATGPT: A Pre-trained Large Language Model with Graph Attention Network
for Spatiotemporal Imputation [19.371155159744934]
In real-world settings, such data often contain missing elements due to issues like sensor malfunctions and data transmission errors.
The objective oftemporal imputation is to estimate these missing values by understanding the inherent spatial and temporal relationships in the observed time series.
Traditionally, intricatetemporal imputation has relied on specific architectures, which suffer from limited applicability and high computational complexity.
In contrast our approach integrates pre-trained large language models (LLMs) into intricatetemporal imputation, introducing a groundbreaking framework, GATGPT.
arXiv Detail & Related papers (2023-11-24T08:15:11Z) - Twin Graph-based Anomaly Detection via Attentive Multi-Modal Learning
for Microservice System [24.2074235652359]
We propose MSTGAD, which seamlessly integrates all available data modalities via attentive multi-modal learning.
We construct a transformer-based neural network with both spatial and temporal attention mechanisms to model the inter-correlations between different modalities.
This enables us to detect anomalies automatically and accurately in real-time.
arXiv Detail & Related papers (2023-10-07T06:28:41Z) - Self-supervised Learning for Anomaly Detection in Computational
Workflows [10.39119516144685]
We introduce an autoencoder-driven self-supervised learning(SSL) approach that learns a summary statistic from unlabeled workflow data.
In this approach, we combine generative and contrastive learning objectives to detect outliers in the summary statistics.
We demonstrate that by estimating the distribution of normal behavior in the latent space, we can outperform state-of-the-art anomaly detection methods on our benchmark datasets.
arXiv Detail & Related papers (2023-10-02T14:31:56Z) - Practical Anomaly Detection over Multivariate Monitoring Metrics for
Online Services [29.37493773435177]
CMAnomaly is an anomaly detection framework on multivariate monitoring metrics based on collaborative machine.
The proposed framework is extensively evaluated with both public data and industrial data collected from a large-scale online service system of Huawei Cloud.
Compared with state-of-the-art baseline models, CMAnomaly achieves an average F1 score of 0.9494, outperforming baselines by 6.77% to 10.68%, and runs 10X to 20X faster.
arXiv Detail & Related papers (2023-08-19T08:08:05Z) - Correlation-aware Spatial-Temporal Graph Learning for Multivariate
Time-series Anomaly Detection [67.60791405198063]
We propose a correlation-aware spatial-temporal graph learning (termed CST-GL) for time series anomaly detection.
CST-GL explicitly captures the pairwise correlations via a multivariate time series correlation learning module.
A novel anomaly scoring component is further integrated into CST-GL to estimate the degree of an anomaly in a purely unsupervised manner.
arXiv Detail & Related papers (2023-07-17T11:04:27Z) - Learning Prompt-Enhanced Context Features for Weakly-Supervised Video
Anomaly Detection [37.99031842449251]
Video anomaly detection under weak supervision presents significant challenges.
We present a weakly supervised anomaly detection framework that focuses on efficient context modeling and enhanced semantic discriminability.
Our approach significantly improves the detection accuracy of certain anomaly sub-classes, underscoring its practical value and efficacy.
arXiv Detail & Related papers (2023-06-26T06:45:16Z) - TELESTO: A Graph Neural Network Model for Anomaly Classification in
Cloud Services [77.454688257702]
Machine learning (ML) and artificial intelligence (AI) are applied on IT system operation and maintenance.
One direction aims at the recognition of re-occurring anomaly types to enable remediation automation.
We propose a method that is invariant to dimensionality changes of given data.
arXiv Detail & Related papers (2021-02-25T14:24:49Z) - Dynamic Federated Learning [57.14673504239551]
Federated learning has emerged as an umbrella term for centralized coordination strategies in multi-agent environments.
We consider a federated learning model where at every iteration, a random subset of available agents perform local updates based on their data.
Under a non-stationary random walk model on the true minimizer for the aggregate optimization problem, we establish that the performance of the architecture is determined by three factors, namely, the data variability at each agent, the model variability across all agents, and a tracking term that is inversely proportional to the learning rate of the algorithm.
arXiv Detail & Related papers (2020-02-20T15:00:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.