Anomaly Detection in a Large-scale Cloud Platform
- URL: http://arxiv.org/abs/2010.10966v2
- Date: Thu, 11 Feb 2021 00:55:55 GMT
- Title: Anomaly Detection in a Large-scale Cloud Platform
- Authors: Mohammad Saiful Islam, William Pourmajidi, Lei Zhang, John
Steinbacher, Tony Erwin, Andriy Miranskyy
- Abstract summary: Cloud computing is ubiquitous: more and more companies are moving the workloads into the Cloud.
Service providers need to monitor the quality of their ever-growing offerings effectively.
We designed and implemented an automated monitoring system for the IBM Cloud Platform.
- Score: 9.283888139549067
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cloud computing is ubiquitous: more and more companies are moving the
workloads into the Cloud. However, this rise in popularity challenges Cloud
service providers, as they need to monitor the quality of their ever-growing
offerings effectively. To address the challenge, we designed and implemented an
automated monitoring system for the IBM Cloud Platform. This monitoring system
utilizes deep learning neural networks to detect anomalies in near-real-time in
multiple Platform components simultaneously.
After running the system for a year, we observed that the proposed solution
frees the DevOps team's time and human resources from manually monitoring
thousands of Cloud components. Moreover, it increases customer satisfaction by
reducing the risk of Cloud outages.
In this paper, we share our solutions' architecture, implementation notes,
and best practices that emerged while evolving the monitoring system. They can
be leveraged by other researchers and practitioners to build anomaly detectors
for complex systems.
Related papers
- Anomaly Detection in Large-Scale Cloud Systems: An Industry Case and Dataset [1.293050392312921]
We introduce a new high-dimensional dataset from IBM Cloud, collected over 4.5 months from the IBM Cloud Console.
This dataset comprises 39,365 rows and 117,448 columns of telemetry data.
We demonstrate the application of machine learning models for anomaly detection and discuss the key challenges faced in this process.
arXiv Detail & Related papers (2024-11-13T22:04:19Z) - CloudHeatMap: Heatmap-Based Monitoring for Large-Scale Cloud Systems [1.1199585259018456]
This paper presents CloudHeatMap, a novel heatmap-based visualization tool for near-real-time monitoring of LCS health.
It offers intuitive visualizations of key metrics such as call volumes, response times, and HTTP response codes, enabling operators to quickly identify performance issues.
arXiv Detail & Related papers (2024-10-28T14:57:10Z) - CloudEye: A New Paradigm of Video Analysis System for Mobile Visual Scenarios [22.871591373774802]
CloudEye is a real-time, efficient mobile visual perception system.
It uses content information mining on edge servers in a mobile vision system environment equipped with edge servers and coordinated with cloud servers.
It reduces network bandwidth usage by 69.50%, increases inference speed by 24.55%, and improves detection accuracy by 67.30%.
arXiv Detail & Related papers (2024-10-24T03:27:05Z) - Scaling Data Science Solutions with Semantics and Machine Learning:
Bosch Case [8.445414390004636]
SemCloud is a semantics-enhanced cloud system with semantic technologies and machine learning.
The system has been evaluated in industrial use case with millions of data, thousands of repeated runs, and domain users, showing promising results.
arXiv Detail & Related papers (2023-08-02T11:58:30Z) - Device-Cloud Collaborative Recommendation via Meta Controller [65.97416287295152]
We propose a meta controller to dynamically manage the collaboration between the on-device recommender and the cloud-based recommender.
On the basis of the counterfactual samples and the extended training, extensive experiments in the industrial recommendation scenarios show the promise of meta controller.
arXiv Detail & Related papers (2022-07-07T03:23:04Z) - Unsupervised Point Cloud Representation Learning with Deep Neural
Networks: A Survey [104.71816962689296]
Unsupervised point cloud representation learning has attracted increasing attention due to the constraint in large-scale point cloud labelling.
This paper provides a comprehensive review of unsupervised point cloud representation learning using deep neural networks.
arXiv Detail & Related papers (2022-02-28T07:46:05Z) - Online Self-Evolving Anomaly Detection in Cloud Computing Environments [6.480575492140354]
We present a emphself-evolving anomaly detection (SEAD) framework for cloud dependability assurance.
Our framework self-evolves by exploring newly verified anomaly records and continuously updating the anomaly detector online.
Our detectors can achieve 88.94% in sensitivity and 94.60% on average, which makes them suitable for real-world deployment.
arXiv Detail & Related papers (2021-11-16T05:13:38Z) - Edge-Cloud Polarization and Collaboration: A Comprehensive Survey [61.05059817550049]
We conduct a systematic review for both cloud and edge AI.
We are the first to set up the collaborative learning mechanism for cloud and edge modeling.
We discuss potentials and practical experiences of some on-going advanced edge AI topics.
arXiv Detail & Related papers (2021-11-11T05:58:23Z) - Auto-Split: A General Framework of Collaborative Edge-Cloud AI [49.750972428032355]
This paper describes the techniques and engineering practice behind Auto-Split, an edge-cloud collaborative prototype of Huawei Cloud.
To the best of our knowledge, there is no existing industry product that provides the capability of Deep Neural Network (DNN) splitting.
arXiv Detail & Related papers (2021-08-30T08:03:29Z) - Device-Cloud Collaborative Learning for Recommendation [50.01289274123047]
We propose a novel MetaPatch learning approach on the device side to efficiently achieve "thousands of people with thousands of models" given a centralized cloud model.
With billions of updated personalized device models, we propose a "model-over-models" distillation algorithm, namely MoMoDistill, to update the centralized cloud model.
arXiv Detail & Related papers (2021-04-14T05:06:59Z) - Towards AIOps in Edge Computing Environments [60.27785717687999]
This paper describes the system design of an AIOps platform which is applicable in heterogeneous, distributed environments.
It is feasible to collect metrics with a high frequency and simultaneously run specific anomaly detection algorithms directly on edge devices.
arXiv Detail & Related papers (2021-02-12T09:33:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.