Intelligent Monitoring Framework for Cloud Services: A Data-Driven Approach
- URL: http://arxiv.org/abs/2403.07927v1
- Date: Thu, 29 Feb 2024 19:40:32 GMT
- Title: Intelligent Monitoring Framework for Cloud Services: A Data-Driven Approach
- Authors: Pooja Srinivas, Fiza Husain, Anjaly Parayil, Ayush Choure, Chetan Bansal, Saravan Rajmohan,
- Abstract summary: Gaps in monitoring can lead to delay in incident detection and significant negative customer impact.
Developers create monitors using their tribal knowledge and, primarily, a trial and error based process.
We propose an intelligent monitoring framework that recommends monitors for cloud services based on their service properties.
- Score: 8.862212993027658
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cloud service owners need to continuously monitor their services to ensure high availability and reliability. Gaps in monitoring can lead to delay in incident detection and significant negative customer impact. Current process of monitor creation is ad-hoc and reactive in nature. Developers create monitors using their tribal knowledge and, primarily, a trial and error based process. As a result, monitors often have incomplete coverage which leads to production issues, or, redundancy which results in noise and wasted effort. In this work, we address this issue by proposing an intelligent monitoring framework that recommends monitors for cloud services based on their service properties. We start by mining the attributes of 30,000+ monitors from 791 production services at Microsoft and derive a structured ontology for monitors. We focus on two crucial dimensions: what to monitor (resources) and which metrics to monitor. We conduct an extensive empirical study and derive key insights on the major classes of monitors employed by cloud services at Microsoft, their associated dimensions, and the interrelationship between service properties and this ontology. Using these insights, we propose a deep learning based framework that recommends monitors based on the service properties. Finally, we conduct a user study with engineers from Microsoft which demonstrates the usefulness of the proposed framework. The proposed framework along with the ontology driven projections, succeeded in creating production quality recommendations for majority of resource classes. This was also validated by the users from the study who rated the framework's usefulness as 4.27 out of 5.
Related papers
- CloudHeatMap: Heatmap-Based Monitoring for Large-Scale Cloud Systems [1.1199585259018456]
This paper presents CloudHeatMap, a novel heatmap-based visualization tool for near-real-time monitoring of LCS health.
It offers intuitive visualizations of key metrics such as call volumes, response times, and HTTP response codes, enabling operators to quickly identify performance issues.
arXiv Detail & Related papers (2024-10-28T14:57:10Z) - Monitoring Auditable Claims in the Cloud [0.0]
We propose a flexible monitoring approach that is independent of the implementation of the observed system.
Our approach is based on combining distributed Datalog-based programs with tamper-proof storage based on Trillian.
We apply our approach to an industrial use case that uses a cloud infrastructure for orchestrating unmanned air vehicles.
arXiv Detail & Related papers (2023-12-19T11:21:18Z) - Optimal Event Monitoring through Internet Mashup over Multivariate Time
Series [77.34726150561087]
This framework supports the services of model definitions, querying, parameter learning, model evaluations, data monitoring, decision recommendations, and web portals.
We further extend the MTSA data model and query language to support this class of problems for the services of learning, monitoring, and recommendation.
arXiv Detail & Related papers (2022-10-18T16:56:17Z) - LogLAB: Attention-Based Labeling of Log Data Anomalies via Weak
Supervision [63.08516384181491]
We present LogLAB, a novel modeling approach for automated labeling of log messages without requiring manual work by experts.
Our method relies on estimated failure time windows provided by monitoring systems to produce precise labeled datasets in retrospect.
Our evaluation shows that LogLAB consistently outperforms nine benchmark approaches across three different datasets and maintains an F1-score of more than 0.98 even at large failure time windows.
arXiv Detail & Related papers (2021-11-02T15:16:08Z) - Benchmarking Safety Monitors for Image Classifiers with Machine Learning [0.0]
High-accurate machine learning (ML) image classifiers cannot guarantee that they will not fail at operation.
The use of fault tolerance mechanisms such as safety monitors is a promising direction to keep the system in a safe state.
This paper aims at establishing a baseline framework for benchmarking monitors for ML image classifiers.
arXiv Detail & Related papers (2021-10-04T07:52:23Z) - Scalable Perception-Action-Communication Loops with Convolutional and
Graph Neural Networks [208.15591625749272]
We present a perception-action-communication loop design using Vision-based Graph Aggregation and Inference (VGAI)
Our framework is implemented by a cascade of a convolutional and a graph neural network (CNN / GNN), addressing agent-level visual perception and feature learning.
We demonstrate that VGAI yields performance comparable to or better than other decentralized controllers.
arXiv Detail & Related papers (2021-06-24T23:57:21Z) - Customizable Reference Runtime Monitoring of Neural Networks using
Resolution Boxes [0.0]
We present an approach for monitoring classification systems via data abstraction.
Box-based abstraction consists in representing a set of values by its minimal and maximal values in each dimension.
We augment boxes with a notion of resolution and define their clustering coverage, which is intuitively a quantitative metric that indicates the abstraction quality.
arXiv Detail & Related papers (2021-04-25T21:58:02Z) - Information Obfuscation of Graph Neural Networks [96.8421624921384]
We study the problem of protecting sensitive attributes by information obfuscation when learning with graph structured data.
We propose a framework to locally filter out pre-determined sensitive attributes via adversarial training with the total variation and the Wasserstein distance.
arXiv Detail & Related papers (2020-09-28T17:55:04Z) - Monitoring and explainability of models in production [58.720142291102135]
Monitoring deployed models is crucial for continued provision of high quality machine learning enabled services.
We discuss the challenges to successful implementation of solutions in each of these areas with some recent examples of production ready solutions using open source tools.
arXiv Detail & Related papers (2020-07-13T10:37:05Z) - A Privacy-Preserving Distributed Architecture for
Deep-Learning-as-a-Service [68.84245063902908]
This paper introduces a novel distributed architecture for deep-learning-as-a-service.
It is able to preserve the user sensitive data while providing Cloud-based machine and deep learning services.
arXiv Detail & Related papers (2020-03-30T15:12:03Z) - Collaborative Inference for Efficient Remote Monitoring [34.27630312942825]
A naive approach to resolve this on the model level is to use simpler architectures.
We propose an alternative solution by decomposing the predictive model as the sum of a simple function which serves as a local monitoring tool.
A sign requirement is imposed on the latter to ensure that the local monitoring function is safe.
arXiv Detail & Related papers (2020-02-12T01:57:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.