Cocktail: Leveraging Ensemble Learning for Optimized Model Serving in
Public Cloud
- URL: http://arxiv.org/abs/2106.05345v1
- Date: Wed, 9 Jun 2021 19:23:58 GMT
- Title: Cocktail: Leveraging Ensemble Learning for Optimized Model Serving in
Public Cloud
- Authors: Jashwant Raj Gunasekaran, Cyan Subhra Mishra, Prashanth Thinakaran,
Mahmut Taylan Kandemir, Chita R. Das
- Abstract summary: We proposeCocktail, a costeffective ensembling-based model serving framework.
A prototype implementation ofCocktailon the AWS EC2 platform and exhaustive evalua-tions using a variety of workloads demonstrate thatCocktailcan reduce deployment cost by 1.45x.
- Score: 9.149566952446058
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With a growing demand for adopting ML models for a varietyof application
services, it is vital that the frameworks servingthese models are capable of
delivering highly accurate predic-tions with minimal latency along with reduced
deploymentcosts in a public cloud environment. Despite high latency,prior works
in this domain are crucially limited by the accu-racy offered by individual
models. Intuitively, model ensem-bling can address the accuracy gap by
intelligently combiningdifferent models in parallel. However, selecting the
appro-priate models dynamically at runtime to meet the desiredaccuracy with low
latency at minimal deployment cost is anontrivial problem. Towards this, we
proposeCocktail, a costeffective ensembling-based model serving
framework.Cock-tailcomprises of two key components: (i) a dynamic
modelselection framework, which reduces the number of modelsin the ensemble,
while satisfying the accuracy and latencyrequirements; (ii) an adaptive
resource management (RM)framework that employs a distributed proactive
autoscalingpolicy combined with importance sampling, to efficiently allo-cate
resources for the models. The RM framework leveragestransient virtual machine
(VM) instances to reduce the de-ployment cost in a public cloud. A prototype
implementationofCocktailon the AWS EC2 platform and exhaustive evalua-tions
using a variety of workloads demonstrate thatCocktailcan reduce deployment cost
by 1.45x, while providing 2xreduction in latency and satisfying the target
accuracy for upto 96% of the requests, when compared to
state-of-the-artmodel-serving frameworks.
Related papers
- Dual-Model Distillation for Efficient Action Classification with Hybrid Edge-Cloud Solution [1.8029479474051309]
We design a hybrid edge-cloud solution that leverages the efficiency of smaller models for local processing while deferring to larger, more accurate cloud-based models when necessary.
Specifically, we propose a novel unsupervised data generation method, Dual-Model Distillation (DMD), to train a lightweight switcher model that can predict when the edge model's output is uncertain.
Experimental results on the action classification task show that our framework not only requires less computational overhead, but also improves accuracy compared to using a large model alone.
arXiv Detail & Related papers (2024-10-16T02:06:27Z) - Towards Robust and Efficient Cloud-Edge Elastic Model Adaptation via Selective Entropy Distillation [56.79064699832383]
We establish a Cloud-Edge Elastic Model Adaptation (CEMA) paradigm in which the edge models only need to perform forward propagation.
In our CEMA, to reduce the communication burden, we devise two criteria to exclude unnecessary samples from uploading to the cloud.
arXiv Detail & Related papers (2024-02-27T08:47:19Z) - ECLM: Efficient Edge-Cloud Collaborative Learning with Continuous
Environment Adaptation [47.35179593006409]
We propose ECLM, an edge-cloud collaborative learning framework for rapid model adaptation for dynamic edge environments.
We show that ECLM significantly improves model performance (e.g., 18.89% accuracy increase) and resource efficiency (e.g. 7.12x communication cost reduction) in adapting models to dynamic edge environments.
arXiv Detail & Related papers (2023-11-18T14:10:09Z) - On Optimal Caching and Model Multiplexing for Large Model Inference [66.50550915522551]
Large Language Models (LLMs) and other large foundation models have achieved noteworthy success, but their size exacerbates existing resource consumption and latency challenges.
We study two approaches for mitigating these challenges: employing a cache to store previous queries and learning a model multiplexer to choose from an ensemble of models for query processing.
arXiv Detail & Related papers (2023-06-03T05:01:51Z) - Scavenger: A Cloud Service for Optimizing Cost and Performance of ML
Training [1.047192732651018]
We develop principled and practical techniques for optimizing the training time and cost of distributed ML model training on the cloud.
By combining conventional parallel scaling concepts and new insights into SGD noise, our models accurately estimate the time and cost on different cluster configurations with 5% error.
arXiv Detail & Related papers (2023-03-12T13:42:39Z) - Complement Sparsification: Low-Overhead Model Pruning for Federated
Learning [2.0428960719376166]
Federated Learning (FL) is a privacy-preserving distributed deep learning paradigm that involves substantial communication and computation effort.
Existing model pruning/sparsification solutions cannot satisfy the requirements for low bidirectional communication overhead between the server and the clients.
We propose Complement Sparsification (CS), a pruning mechanism that satisfies all these requirements through a complementary and collaborative pruning done at the server and the clients.
arXiv Detail & Related papers (2023-03-10T23:07:02Z) - MILO: Model-Agnostic Subset Selection Framework for Efficient Model
Training and Tuning [68.12870241637636]
We propose MILO, a model-agnostic subset selection framework that decouples the subset selection from model training.
Our empirical results indicate that MILO can train models $3times - 10 times$ faster and tune hyperparameters $20times - 75 times$ faster than full-dataset training or tuning without performance.
arXiv Detail & Related papers (2023-01-30T20:59:30Z) - DualCF: Efficient Model Extraction Attack from Counterfactual
Explanations [57.46134660974256]
Cloud service providers have launched Machine-Learning-as-a-Service platforms to allow users to access large-scale cloudbased models via APIs.
Such extra information inevitably causes the cloud models to be more vulnerable to extraction attacks.
We propose a novel simple yet efficient querying strategy to greatly enhance the querying efficiency to steal a classification model.
arXiv Detail & Related papers (2022-05-13T08:24:43Z) - Data Summarization via Bilevel Optimization [48.89977988203108]
A simple yet powerful approach is to operate on small subsets of data.
In this work, we propose a generic coreset framework that formulates the coreset selection as a cardinality-constrained bilevel optimization problem.
arXiv Detail & Related papers (2021-09-26T09:08:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.