Anomaly Detection in a Large-scale Cloud Platform
- URL: http://arxiv.org/abs/2010.10966v2
- Date: Thu, 11 Feb 2021 00:55:55 GMT
- Title: Anomaly Detection in a Large-scale Cloud Platform
- Authors: Mohammad Saiful Islam, William Pourmajidi, Lei Zhang, John
Steinbacher, Tony Erwin, Andriy Miranskyy
- Abstract summary: Cloud computing is ubiquitous: more and more companies are moving the workloads into the Cloud.
Service providers need to monitor the quality of their ever-growing offerings effectively.
We designed and implemented an automated monitoring system for the IBM Cloud Platform.
- Score: 9.283888139549067
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cloud computing is ubiquitous: more and more companies are moving the
workloads into the Cloud. However, this rise in popularity challenges Cloud
service providers, as they need to monitor the quality of their ever-growing
offerings effectively. To address the challenge, we designed and implemented an
automated monitoring system for the IBM Cloud Platform. This monitoring system
utilizes deep learning neural networks to detect anomalies in near-real-time in
multiple Platform components simultaneously.
After running the system for a year, we observed that the proposed solution
frees the DevOps team's time and human resources from manually monitoring
thousands of Cloud components. Moreover, it increases customer satisfaction by
reducing the risk of Cloud outages.
In this paper, we share our solutions' architecture, implementation notes,
and best practices that emerged while evolving the monitoring system. They can
be leveraged by other researchers and practitioners to build anomaly detectors
for complex systems.
Related papers
- Monitoring Auditable Claims in the Cloud [0.0]
We propose a flexible monitoring approach that is independent of the implementation of the observed system.
Our approach is based on combining distributed Datalog-based programs with tamper-proof storage based on Trillian.
We apply our approach to an industrial use case that uses a cloud infrastructure for orchestrating unmanned air vehicles.
arXiv Detail & Related papers (2023-12-19T11:21:18Z) - Scaling Data Science Solutions with Semantics and Machine Learning:
Bosch Case [8.445414390004636]
SemCloud is a semantics-enhanced cloud system with semantic technologies and machine learning.
The system has been evaluated in industrial use case with millions of data, thousands of repeated runs, and domain users, showing promising results.
arXiv Detail & Related papers (2023-08-02T11:58:30Z) - Managing Cold-start in The Serverless Cloud with Temporal Convolutional
Networks [0.0]
Serverless cloud is an innovative cloud service model that frees customers from most cloud management duties.
A big threat to the serverless cloud's performance is cold-start, which is when the time of provisioning the needed cloud resource to serve customers' requests incurs unacceptable costs to the service providers and/or the customers.
This paper proposes a novel low-coupling, high-cohesion ensemble policy that addresses the cold-start problem at infrastructure- and function-levels of the serverless cloud stack.
arXiv Detail & Related papers (2023-04-01T21:54:22Z) - IDEAL: Toward High-efficiency Device-Cloud Collaborative and Dynamic
Recommendation System [48.04687384069841]
Two trends enable the device-cloud collaborative and dynamic recommendation.
We design a new device intelligence task to implement I by detecting the data out-of-domain.
Our study demonstrates Is effectiveness and generalizability on four public benchmarks.
arXiv Detail & Related papers (2023-02-14T20:44:12Z) - Device-Cloud Collaborative Recommendation via Meta Controller [65.97416287295152]
We propose a meta controller to dynamically manage the collaboration between the on-device recommender and the cloud-based recommender.
On the basis of the counterfactual samples and the extended training, extensive experiments in the industrial recommendation scenarios show the promise of meta controller.
arXiv Detail & Related papers (2022-07-07T03:23:04Z) - Unsupervised Point Cloud Representation Learning with Deep Neural
Networks: A Survey [104.71816962689296]
Unsupervised point cloud representation learning has attracted increasing attention due to the constraint in large-scale point cloud labelling.
This paper provides a comprehensive review of unsupervised point cloud representation learning using deep neural networks.
arXiv Detail & Related papers (2022-02-28T07:46:05Z) - Online Self-Evolving Anomaly Detection in Cloud Computing Environments [6.480575492140354]
We present a emphself-evolving anomaly detection (SEAD) framework for cloud dependability assurance.
Our framework self-evolves by exploring newly verified anomaly records and continuously updating the anomaly detector online.
Our detectors can achieve 88.94% in sensitivity and 94.60% on average, which makes them suitable for real-world deployment.
arXiv Detail & Related papers (2021-11-16T05:13:38Z) - Edge-Cloud Polarization and Collaboration: A Comprehensive Survey [61.05059817550049]
We conduct a systematic review for both cloud and edge AI.
We are the first to set up the collaborative learning mechanism for cloud and edge modeling.
We discuss potentials and practical experiences of some on-going advanced edge AI topics.
arXiv Detail & Related papers (2021-11-11T05:58:23Z) - Auto-Split: A General Framework of Collaborative Edge-Cloud AI [49.750972428032355]
This paper describes the techniques and engineering practice behind Auto-Split, an edge-cloud collaborative prototype of Huawei Cloud.
To the best of our knowledge, there is no existing industry product that provides the capability of Deep Neural Network (DNN) splitting.
arXiv Detail & Related papers (2021-08-30T08:03:29Z) - Device-Cloud Collaborative Learning for Recommendation [50.01289274123047]
We propose a novel MetaPatch learning approach on the device side to efficiently achieve "thousands of people with thousands of models" given a centralized cloud model.
With billions of updated personalized device models, we propose a "model-over-models" distillation algorithm, namely MoMoDistill, to update the centralized cloud model.
arXiv Detail & Related papers (2021-04-14T05:06:59Z) - Towards Deep Federated Defenses Against Malware in Cloud Ecosystems [0.24366811507669117]
In cloud computing environments with many virtual machines, containers, and other systems, an epidemic of malware can be highly threatening to business processes.
We introduce a hierarchical approach to performing malware detection and analysis using several recent advances in machine learning on graphs, hypergraphs, and natural language.
arXiv Detail & Related papers (2019-12-27T23:46:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.