Sinan: Data-Driven, QoS-Aware Cluster Management for Microservices
- URL: http://arxiv.org/abs/2105.13424v1
- Date: Thu, 27 May 2021 19:57:51 GMT
- Title: Sinan: Data-Driven, QoS-Aware Cluster Management for Microservices
- Authors: Yanqi Zhang, Weizhe Hua, Zhuangzhuang Zhou, Edward Suh, Christina
Delimitrou
- Abstract summary: Sinan is a data-driven cluster manager for interactive cloud that is online and allocate-aware.
We present Sinan, a data-driven cluster manager for interactive cloud that is online and allocate-aware.
- Score: 3.6923632650826477
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Cloud applications are increasingly shifting from large monolithic services,
to large numbers of loosely-coupled, specialized microservices. Despite their
advantages in terms of facilitating development, deployment, modularity, and
isolation, microservices complicate resource management, as dependencies
between them introduce backpressure effects and cascading QoS violations.
We present Sinan, a data-driven cluster manager for interactive cloud
microservices that is online and QoS-aware. Sinan leverages a set of scalable
and validated machine learning models to determine the performance impact of
dependencies between microservices, and allocate appropriate resources per tier
in a way that preserves the end-to-end tail latency target. We evaluate Sinan
both on dedicated local clusters and large-scale deployments on Google Compute
Engine (GCE) across representative end-to-end applications built with
microservices, such as social networks and hotel reservation sites. We show
that Sinan always meets QoS, while also maintaining cluster utilization high,
in contrast to prior work which leads to unpredictable performance or
sacrifices resource efficiency. Furthermore, the techniques in Sinan are
explainable, meaning that cloud operators can yield insights from the ML models
on how to better deploy and design their applications to reduce unpredictable
performance.
Related papers
- DeepScaler: Holistic Autoscaling for Microservices Based on
Spatiotemporal GNN with Adaptive Graph Learning [4.128665560397244]
This paper presents DeepScaler, a deep learning-based holistic autoscaling approach.
It focuses on coping with service dependencies to optimize service-level agreements (SLA) assurance and cost efficiency.
Experimental results demonstrate that our method implements a more effective autoscaling mechanism for microservice.
arXiv Detail & Related papers (2023-09-02T08:22:21Z) - Alioth: A Machine Learning Based Interference-Aware Performance Monitor
for Multi-Tenancy Applications in Public Cloud [15.942285615596566]
Multi-tenancy in public clouds may lead to co-location interference on shared resources, which possibly results in performance degradation.
We propose a novel machine learning framework, Alioth, to monitor the performance degradation of cloud applications.
Alioth achieves an average mean absolute error of 5.29% offline and 10.8% when testing on applications unseen in the training stage.
arXiv Detail & Related papers (2023-07-18T03:34:33Z) - In Situ Framework for Coupling Simulation and Machine Learning with
Application to CFD [51.04126395480625]
Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations.
As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks.
This work offers a solution by simplifying this coupling and enabling in situ training and inference on heterogeneous clusters.
arXiv Detail & Related papers (2023-06-22T14:07:54Z) - Predicting Resource Consumption of Kubernetes Container Systems using
Resource Models [3.138731415322007]
This paper considers how to derive resource models for cloud systems empirically.
We do so based on models of deployed services in a formal language with explicit adherence to CPU and memory resources.
We report on leveraging data collected empirically from small deployments to simulate the execution of higher intensity scenarios on larger deployments.
arXiv Detail & Related papers (2023-05-12T17:59:01Z) - Benchmarking scalability of stream processing frameworks deployed as
microservices in the cloud [0.38073142980732994]
We benchmark five modern stream processing frameworks regarding their scalability using a systematic method.
All benchmarked frameworks exhibit approximately linear scalability as long as sufficient cloud resources are provisioned.
There is no clear superior framework, but the ranking of the frameworks on the use case.
arXiv Detail & Related papers (2023-03-20T13:22:03Z) - Neural Attentive Circuits [93.95502541529115]
We introduce a general purpose, yet modular neural architecture called Neural Attentive Circuits (NACs)
NACs learn the parameterization and a sparse connectivity of neural modules without using domain knowledge.
NACs achieve an 8x speedup at inference time while losing less than 3% performance.
arXiv Detail & Related papers (2022-10-14T18:00:07Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - Auto-Split: A General Framework of Collaborative Edge-Cloud AI [49.750972428032355]
This paper describes the techniques and engineering practice behind Auto-Split, an edge-cloud collaborative prototype of Huawei Cloud.
To the best of our knowledge, there is no existing industry product that provides the capability of Deep Neural Network (DNN) splitting.
arXiv Detail & Related papers (2021-08-30T08:03:29Z) - Exploring the potential of flow-based programming for machine learning
deployment in comparison with service-oriented architectures [8.677012233188968]
We argue that part of the reason is infrastructure that was not designed for activities around data collection and analysis.
We propose to consider flow-based programming with data streams as an alternative to commonly used service-oriented architectures for building software applications.
arXiv Detail & Related papers (2021-08-09T15:06:02Z) - Federated Learning with Unreliable Clients: Performance Analysis and
Mechanism Design [76.29738151117583]
Federated Learning (FL) has become a promising tool for training effective machine learning models among distributed clients.
However, low quality models could be uploaded to the aggregator server by unreliable clients, leading to a degradation or even a collapse of training.
We model these unreliable behaviors of clients and propose a defensive mechanism to mitigate such a security risk.
arXiv Detail & Related papers (2021-05-10T08:02:27Z) - A Privacy-Preserving Distributed Architecture for
Deep-Learning-as-a-Service [68.84245063902908]
This paper introduces a novel distributed architecture for deep-learning-as-a-service.
It is able to preserve the user sensitive data while providing Cloud-based machine and deep learning services.
arXiv Detail & Related papers (2020-03-30T15:12:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.