Cluster-Wide Task Slowdown Detection in Cloud System
- URL: http://arxiv.org/abs/2408.04236v1
- Date: Thu, 8 Aug 2024 05:43:20 GMT
- Title: Cluster-Wide Task Slowdown Detection in Cloud System
- Authors: Feiyi Chen, Yingying Zhang, Lunting Fan, Yuxuan Liang, Guansong Pang, Qingsong Wen, Shuiguang Deng,
- Abstract summary: Slow task detection is a critical problem in cloud operation and maintenance.
Most anomaly detection methods detect it from a single-task aspect.
We propose SORN, which consists of a Skimming Attention mechanism to reconstruct the compound periodicity.
- Score: 45.396508032554564
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Slow task detection is a critical problem in cloud operation and maintenance since it is highly related to user experience and can bring substantial liquidated damages. Most anomaly detection methods detect it from a single-task aspect. However, considering millions of concurrent tasks in large-scale cloud computing clusters, it becomes impractical and inefficient. Moreover, single-task slowdowns are very common and do not necessarily indicate a malfunction of a cluster due to its violent fluctuation nature in a virtual environment. Thus, we shift our attention to cluster-wide task slowdowns by utilizing the duration time distribution of tasks across a cluster, so that the computation complexity is not relevant to the number of tasks. The task duration time distribution often exhibits compound periodicity and local exceptional fluctuations over time. Though transformer-based methods are one of the most powerful methods to capture these time series normal variation patterns, we empirically find and theoretically explain the flaw of the standard attention mechanism in reconstructing subperiods with low amplitude when dealing with compound periodicity. To tackle these challenges, we propose SORN (i.e., Skimming Off subperiods in descending amplitude order and Reconstructing Non-slowing fluctuation), which consists of a Skimming Attention mechanism to reconstruct the compound periodicity and a Neural Optimal Transport module to distinguish cluster-wide slowdowns from other exceptional fluctuations. Furthermore, since anomalies in the training set are inevitable in a practical scenario, we propose a picky loss function, which adaptively assigns higher weights to reliable time slots in the training set. Extensive experiments demonstrate that SORN outperforms state-of-the-art methods on multiple real-world industrial datasets.
Related papers
- Pattern-Based Time-Series Risk Scoring for Anomaly Detection and Alert Filtering -- A Predictive Maintenance Case Study [3.508168174653255]
We propose a fast and efficient approach to anomaly detection and alert filtering based on sequential pattern similarities.
We show how this approach can be leveraged for a variety of purposes involving anomaly detection on a large scale real-world industrial system.
arXiv Detail & Related papers (2024-05-24T20:27:45Z) - Concrete Dense Network for Long-Sequence Time Series Clustering [4.307648859471193]
Time series clustering is fundamental in data analysis for discovering temporal patterns.
Deep temporal clustering methods have been trying to integrate the canonical k-means into end-to-end training of neural networks.
LoSTer is a novel dense autoencoder architecture for the long-sequence time series clustering problem.
arXiv Detail & Related papers (2024-05-08T12:31:35Z) - Graph Spatiotemporal Process for Multivariate Time Series Anomaly
Detection with Missing Values [67.76168547245237]
We introduce a novel framework called GST-Pro, which utilizes a graphtemporal process and anomaly scorer to detect anomalies.
Our experimental results show that the GST-Pro method can effectively detect anomalies in time series data and outperforms state-of-the-art methods.
arXiv Detail & Related papers (2024-01-11T10:10:16Z) - Spatio-temporal predictive tasks for abnormal event detection in videos [60.02503434201552]
We propose new constrained pretext tasks to learn object level normality patterns.
Our approach consists in learning a mapping between down-scaled visual queries and their corresponding normal appearance and motion characteristics.
Experiments on several benchmark datasets demonstrate the effectiveness of our approach to localize and track anomalies.
arXiv Detail & Related papers (2022-10-27T19:45:12Z) - Anomaly Transformer: Time Series Anomaly Detection with Association
Discrepancy [68.86835407617778]
Anomaly Transformer achieves state-of-the-art performance on six unsupervised time series anomaly detection benchmarks.
Anomaly Transformer achieves state-of-the-art performance on six unsupervised time series anomaly detection benchmarks.
arXiv Detail & Related papers (2021-10-06T10:33:55Z) - Federated Variational Learning for Anomaly Detection in Multivariate
Time Series [13.328883578980237]
We propose an unsupervised time series anomaly detection framework in a federated fashion.
We leave the training data distributed at the edge to learn a shared Variational Autoencoder (VAE) based on Convolutional Gated Recurrent Unit (ConvGRU) model.
Experiments on three real-world networked sensor datasets illustrate the advantage of our approach over other state-of-the-art models.
arXiv Detail & Related papers (2021-08-18T22:23:15Z) - Consistency of mechanistic causal discovery in continuous-time using
Neural ODEs [85.7910042199734]
We consider causal discovery in continuous-time for the study of dynamical systems.
We propose a causal discovery algorithm based on penalized Neural ODEs.
arXiv Detail & Related papers (2021-05-06T08:48:02Z) - Low-Rank Autoregressive Tensor Completion for Spatiotemporal Traffic
Data Imputation [4.9831085918734805]
Missing data imputation has been a long-standing research topic and critical application for real-world intelligent transportation systems.
We propose a low-rank autoregressive tensor completion (LATC) framework by introducing textittemporal variation as a new regularization term.
We conduct extensive numerical experiments on several real-world traffic data sets, and our results demonstrate the effectiveness of LATC in diverse missing scenarios.
arXiv Detail & Related papers (2021-04-30T12:00:57Z) - Granger Causality Based Hierarchical Time Series Clustering for State
Estimation [8.384689499720515]
Clustering is useful when working with a large volume of unlabeled data.
We propose a hierarchical time series clustering technique based on symbolic dynamic filtering and Granger causality.
A new distance metric based on Granger causality is proposed and used for the time series clustering, as well as validated on empirical data sets.
arXiv Detail & Related papers (2021-04-09T06:14:54Z) - TadGAN: Time Series Anomaly Detection Using Generative Adversarial
Networks [73.01104041298031]
TadGAN is an unsupervised anomaly detection approach built on Generative Adversarial Networks (GANs)
To capture the temporal correlations of time series, we use LSTM Recurrent Neural Networks as base models for Generators and Critics.
To demonstrate the performance and generalizability of our approach, we test several anomaly scoring techniques and report the best-suited one.
arXiv Detail & Related papers (2020-09-16T15:52:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.