Serverless Model Serving for Data Science
- URL: http://arxiv.org/abs/2103.02958v1
- Date: Thu, 4 Mar 2021 11:23:01 GMT
- Title: Serverless Model Serving for Data Science
- Authors: Yuncheng Wu, Tien Tuan Anh Dinh, Guoyu Hu, Meihui Zhang, Yeow Meng
Chee, Beng Chin Ooi
- Abstract summary: We study the viability of serverless as a mainstream model serving platform for data science applications.
We find that serverless outperforms many cloud-based alternatives with respect to cost and performance.
We present several practical recommendations for data scientists on how to use serverless for scalable and cost-effective model serving.
- Score: 23.05534539170047
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Machine learning (ML) is an important part of modern data science
applications. Data scientists today have to manage the end-to-end ML life cycle
that includes both model training and model serving, the latter of which is
essential, as it makes their works available to end-users. Systems for model
serving require high performance, low cost, and ease of management. Cloud
providers are already offering model serving options, including managed
services and self-rented servers. Recently, serverless computing, whose
advantages include high elasticity and fine-grained cost model, brings another
possibility for model serving.
In this paper, we study the viability of serverless as a mainstream model
serving platform for data science applications. We conduct a comprehensive
evaluation of the performance and cost of serverless against other model
serving systems on two clouds: Amazon Web Service (AWS) and Google Cloud
Platform (GCP). We find that serverless outperforms many cloud-based
alternatives with respect to cost and performance. More interestingly, under
some circumstances, it can even outperform GPU-based systems for both average
latency and cost. These results are different from previous works' claim that
serverless is not suitable for model serving, and are contrary to the
conventional wisdom that GPU-based systems are better for ML workloads than
CPU-based systems. Other findings include a large gap in cold start time
between AWS and GCP serverless functions, and serverless' low sensitivity to
changes in workloads or models. Our evaluation results indicate that serverless
is a viable option for model serving. Finally, we present several practical
recommendations for data scientists on how to use serverless for scalable and
cost-effective model serving.
Related papers
- FusedInf: Efficient Swapping of DNN Models for On-Demand Serverless Inference Services on the Edge [2.1119495676190128]
We introduce FusedInf to efficiently swap DNN models for on-demand serverless inference services on the edge.
Our evaluation of popular DNN models showed that creating a single DAG can make the execution of the models up to 14% faster.
arXiv Detail & Related papers (2024-10-28T15:21:23Z) - SeBS-Flow: Benchmarking Serverless Cloud Function Workflows [51.4200085836966]
We propose the first serverless workflow benchmarking suite SeBS-Flow.
SeBS-Flow includes six real-world application benchmarks and four microbenchmarks representing different computational patterns.
We conduct comprehensive evaluations on three major cloud platforms, assessing performance, cost, scalability, and runtime deviations.
arXiv Detail & Related papers (2024-10-04T14:52:18Z) - SpotServe: Serving Generative Large Language Models on Preemptible
Instances [64.18638174004151]
SpotServe is the first distributed large language models serving system on preemptible instances.
We show that SpotServe can reduce the P99 tail latency by 2.4 - 9.1x compared with the best existing LLM serving systems.
We also show that SpotServe can leverage the price advantage of preemptive instances, saving 54% monetary cost compared with only using on-demand instances.
arXiv Detail & Related papers (2023-11-27T06:31:17Z) - Cheaply Evaluating Inference Efficiency Metrics for Autoregressive
Transformer APIs [66.30706841821123]
Large language models (LLMs) power many state-of-the-art systems in natural language processing.
LLMs are extremely computationally expensive, even at inference time.
We propose a new metric for comparing inference efficiency across models.
arXiv Detail & Related papers (2023-05-03T21:51:42Z) - DualCF: Efficient Model Extraction Attack from Counterfactual
Explanations [57.46134660974256]
Cloud service providers have launched Machine-Learning-as-a-Service platforms to allow users to access large-scale cloudbased models via APIs.
Such extra information inevitably causes the cloud models to be more vulnerable to extraction attacks.
We propose a novel simple yet efficient querying strategy to greatly enhance the querying efficiency to steal a classification model.
arXiv Detail & Related papers (2022-05-13T08:24:43Z) - Performance Modeling of Metric-Based Serverless Computing Platforms [5.089110111757978]
The proposed performance model can help developers and providers predict the performance and cost of deployments with different configurations.
We validate the applicability and accuracy of the proposed performance model by extensive real-world experimentation on Knative.
arXiv Detail & Related papers (2022-02-23T00:39:01Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - Serverless inferencing on Kubernetes [0.0]
We will discuss the KFServing project which builds on the KNative serverless paradigm to provide a serverless machine learning inference solution.
We will show how it solves the challenges of autoscaling GPU based inference and discuss some of the lessons learnt from using it in production.
arXiv Detail & Related papers (2020-07-14T21:23:59Z) - Superiority of Simplicity: A Lightweight Model for Network Device
Workload Prediction [58.98112070128482]
We propose a lightweight solution for series prediction based on historic observations.
It consists of a heterogeneous ensemble method composed of two models - a neural network and a mean predictor.
It achieves an overall $R2$ score of 0.10 on the available FedCSIS 2020 challenge dataset.
arXiv Detail & Related papers (2020-07-07T15:44:16Z) - MLModelCI: An Automatic Cloud Platform for Efficient MLaaS [15.029094196394862]
We release the platform as an open-source project on GitHub under Apache 2.0 license.
Our system bridges the gap between current ML training and serving systems and thus free developers from manual and tedious work often associated with service deployment.
arXiv Detail & Related papers (2020-06-09T07:48:20Z) - Characterizing and Modeling Distributed Training with Transient Cloud
GPU Servers [6.56704851092678]
We analyze distributed training performance under diverse cluster configurations using CM-DARE.
Our empirical datasets include measurements from three GPU types, six geographic regions, twenty convolutional neural networks, and thousands of Google Cloud servers.
We also demonstrate the feasibility of predicting training speed and overhead using regression-based models.
arXiv Detail & Related papers (2020-04-07T01:49:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.