Cluster-Wide Task Slowdown Detection in Cloud System
- URL: http://arxiv.org/abs/2408.04236v1
- Date: Thu, 8 Aug 2024 05:43:20 GMT
- Title: Cluster-Wide Task Slowdown Detection in Cloud System
- Authors: Feiyi Chen, Yingying Zhang, Lunting Fan, Yuxuan Liang, Guansong Pang, Qingsong Wen, Shuiguang Deng,
- Abstract summary: Slow task detection is a critical problem in cloud operation and maintenance.
Most anomaly detection methods detect it from a single-task aspect.
We propose SORN, which consists of a Skimming Attention mechanism to reconstruct the compound periodicity.
- Score: 45.396508032554564
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Slow task detection is a critical problem in cloud operation and maintenance since it is highly related to user experience and can bring substantial liquidated damages. Most anomaly detection methods detect it from a single-task aspect. However, considering millions of concurrent tasks in large-scale cloud computing clusters, it becomes impractical and inefficient. Moreover, single-task slowdowns are very common and do not necessarily indicate a malfunction of a cluster due to its violent fluctuation nature in a virtual environment. Thus, we shift our attention to cluster-wide task slowdowns by utilizing the duration time distribution of tasks across a cluster, so that the computation complexity is not relevant to the number of tasks. The task duration time distribution often exhibits compound periodicity and local exceptional fluctuations over time. Though transformer-based methods are one of the most powerful methods to capture these time series normal variation patterns, we empirically find and theoretically explain the flaw of the standard attention mechanism in reconstructing subperiods with low amplitude when dealing with compound periodicity. To tackle these challenges, we propose SORN (i.e., Skimming Off subperiods in descending amplitude order and Reconstructing Non-slowing fluctuation), which consists of a Skimming Attention mechanism to reconstruct the compound periodicity and a Neural Optimal Transport module to distinguish cluster-wide slowdowns from other exceptional fluctuations. Furthermore, since anomalies in the training set are inevitable in a practical scenario, we propose a picky loss function, which adaptively assigns higher weights to reliable time slots in the training set. Extensive experiments demonstrate that SORN outperforms state-of-the-art methods on multiple real-world industrial datasets.
Related papers
- MAAT: Mamba Adaptive Anomaly Transformer with association discrepancy for time series [5.924110046959179]
Anomaly detection in time series is essential for industrial monitoring and environmental sensing.
Existing methods face limitations such as sensitivity to short-term contexts and inefficiency in noisy, non-stationary environments.
We introduce MAAT, an improved architecture that enhances association discrepancy modeling and reconstruction quality.
arXiv Detail & Related papers (2025-02-11T16:22:06Z) - Multivariate Time Series Anomaly Detection by Capturing Coarse-Grained Intra- and Inter-Variate Dependencies [14.784236273395017]
We introduce MtsCID, a novel semi-supervised multivariate time series anomaly detection method.
We show that MtsCID achieves performance comparable or superior to state-of-the-art benchmark methods.
arXiv Detail & Related papers (2025-01-22T05:53:12Z) - Pattern-Based Time-Series Risk Scoring for Anomaly Detection and Alert Filtering -- A Predictive Maintenance Case Study [3.508168174653255]
We propose a fast and efficient approach to anomaly detection and alert filtering based on sequential pattern similarities.
We show how this approach can be leveraged for a variety of purposes involving anomaly detection on a large scale real-world industrial system.
arXiv Detail & Related papers (2024-05-24T20:27:45Z) - Concrete Dense Network for Long-Sequence Time Series Clustering [4.307648859471193]
Time series clustering is fundamental in data analysis for discovering temporal patterns.
Deep temporal clustering methods have been trying to integrate the canonical k-means into end-to-end training of neural networks.
LoSTer is a novel dense autoencoder architecture for the long-sequence time series clustering problem.
arXiv Detail & Related papers (2024-05-08T12:31:35Z) - Graph Spatiotemporal Process for Multivariate Time Series Anomaly
Detection with Missing Values [67.76168547245237]
We introduce a novel framework called GST-Pro, which utilizes a graphtemporal process and anomaly scorer to detect anomalies.
Our experimental results show that the GST-Pro method can effectively detect anomalies in time series data and outperforms state-of-the-art methods.
arXiv Detail & Related papers (2024-01-11T10:10:16Z) - Spatio-temporal predictive tasks for abnormal event detection in videos [60.02503434201552]
We propose new constrained pretext tasks to learn object level normality patterns.
Our approach consists in learning a mapping between down-scaled visual queries and their corresponding normal appearance and motion characteristics.
Experiments on several benchmark datasets demonstrate the effectiveness of our approach to localize and track anomalies.
arXiv Detail & Related papers (2022-10-27T19:45:12Z) - Anomaly Transformer: Time Series Anomaly Detection with Association
Discrepancy [68.86835407617778]
Anomaly Transformer achieves state-of-the-art performance on six unsupervised time series anomaly detection benchmarks.
Anomaly Transformer achieves state-of-the-art performance on six unsupervised time series anomaly detection benchmarks.
arXiv Detail & Related papers (2021-10-06T10:33:55Z) - Consistency of mechanistic causal discovery in continuous-time using
Neural ODEs [85.7910042199734]
We consider causal discovery in continuous-time for the study of dynamical systems.
We propose a causal discovery algorithm based on penalized Neural ODEs.
arXiv Detail & Related papers (2021-05-06T08:48:02Z) - Low-Rank Autoregressive Tensor Completion for Spatiotemporal Traffic
Data Imputation [4.9831085918734805]
Missing data imputation has been a long-standing research topic and critical application for real-world intelligent transportation systems.
We propose a low-rank autoregressive tensor completion (LATC) framework by introducing textittemporal variation as a new regularization term.
We conduct extensive numerical experiments on several real-world traffic data sets, and our results demonstrate the effectiveness of LATC in diverse missing scenarios.
arXiv Detail & Related papers (2021-04-30T12:00:57Z) - TadGAN: Time Series Anomaly Detection Using Generative Adversarial
Networks [73.01104041298031]
TadGAN is an unsupervised anomaly detection approach built on Generative Adversarial Networks (GANs)
To capture the temporal correlations of time series, we use LSTM Recurrent Neural Networks as base models for Generators and Critics.
To demonstrate the performance and generalizability of our approach, we test several anomaly scoring techniques and report the best-suited one.
arXiv Detail & Related papers (2020-09-16T15:52:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.