Related papers: Serverless Model Serving for Data Science

Serverless Model Serving for Data Science

URL: http://arxiv.org/abs/2103.02958v1
Date: Thu, 4 Mar 2021 11:23:01 GMT
Title: Serverless Model Serving for Data Science
Authors: Yuncheng Wu, Tien Tuan Anh Dinh, Guoyu Hu, Meihui Zhang, Yeow Meng Chee, Beng Chin Ooi
Abstract summary: We study the viability of serverless as a mainstream model serving platform for data science applications. We find that serverless outperforms many cloud-based alternatives with respect to cost and performance. We present several practical recommendations for data scientists on how to use serverless for scalable and cost-effective model serving.
Score: 23.05534539170047
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Machine learning (ML) is an important part of modern data science applications. Data scientists today have to manage the end-to-end ML life cycle that includes both model training and model serving, the latter of which is essential, as it makes their works available to end-users. Systems for model serving require high performance, low cost, and ease of management. Cloud providers are already offering model serving options, including managed services and self-rented servers. Recently, serverless computing, whose advantages include high elasticity and fine-grained cost model, brings another possibility for model serving. In this paper, we study the viability of serverless as a mainstream model serving platform for data science applications. We conduct a comprehensive evaluation of the performance and cost of serverless against other model serving systems on two clouds: Amazon Web Service (AWS) and Google Cloud Platform (GCP). We find that serverless outperforms many cloud-based alternatives with respect to cost and performance. More interestingly, under some circumstances, it can even outperform GPU-based systems for both average latency and cost. These results are different from previous works' claim that serverless is not suitable for model serving, and are contrary to the conventional wisdom that GPU-based systems are better for ML workloads than CPU-based systems. Other findings include a large gap in cold start time between AWS and GCP serverless functions, and serverless' low sensitivity to changes in workloads or models. Our evaluation results indicate that serverless is a viable option for model serving. Finally, we present several practical recommendations for data scientists on how to use serverless for scalable and cost-effective model serving.

Related papers

External Large Foundation Model: How to Efficiently Serve Trillions of Parameters for Online Ads Recommendation [58.194356020695906]
Ads recommendation is a prominent service of online advertising systems and has been actively studied. Recent studies indicate that scaling-up and advanced design of the recommendation model can bring significant performance improvement. However, with a larger model scale, such prior studies have a significantly increasing gap from industry as they often neglect two fundamental challenges in industrial-scale applications.
arXiv Detail & Related papers (2025-02-20T22:35:52Z)
SeSeMI: Secure Serverless Model Inference on Sensitive Data [14.820151992047089]
Existing cloud-based model inference systems are costly, not easy to scale, and must be trusted in handling the models and user request data. Our goal is to design a serverless model inference system that protects models and user request data from untrusted cloud providers. We present SeSeMI, a secure, efficient, and cost-effective serverless model inference system.
arXiv Detail & Related papers (2024-12-16T10:37:30Z)
FusedInf: Efficient Swapping of DNN Models for On-Demand Serverless Inference Services on the Edge [2.1119495676190128]
We introduce FusedInf to efficiently swap DNN models for on-demand serverless inference services on the edge. Our evaluation of popular DNN models showed that creating a single DAG can make the execution of the models up to 14% faster.
arXiv Detail & Related papers (2024-10-28T15:21:23Z)
SeBS-Flow: Benchmarking Serverless Cloud Function Workflows [51.4200085836966]
We propose the first serverless workflow benchmarking suite SeBS-Flow. SeBS-Flow includes six real-world application benchmarks and four microbenchmarks representing different computational patterns. We conduct comprehensive evaluations on three major cloud platforms, assessing performance, cost, scalability, and runtime deviations.
arXiv Detail & Related papers (2024-10-04T14:52:18Z)
SpotServe: Serving Generative Large Language Models on Preemptible Instances [64.18638174004151]
SpotServe is the first distributed large language models serving system on preemptible instances. We show that SpotServe can reduce the P99 tail latency by 2.4 - 9.1x compared with the best existing LLM serving systems. We also show that SpotServe can leverage the price advantage of preemptive instances, saving 54% monetary cost compared with only using on-demand instances.
arXiv Detail & Related papers (2023-11-27T06:31:17Z)
Cheaply Evaluating Inference Efficiency Metrics for Autoregressive Transformer APIs [66.30706841821123]
Large language models (LLMs) power many state-of-the-art systems in natural language processing. LLMs are extremely computationally expensive, even at inference time. We propose a new metric for comparing inference efficiency across models.
arXiv Detail & Related papers (2023-05-03T21:51:42Z)
DUET: A Tuning-Free Device-Cloud Collaborative Parameters Generation Framework for Efficient Device Model Generalization [66.27399823422665]
Device Model Generalization (DMG) is a practical yet under-investigated research topic for on-device machine learning applications. We propose an efficient Device-cloUd collaborative parametErs generaTion framework DUET.
arXiv Detail & Related papers (2022-09-12T13:26:26Z)
DualCF: Efficient Model Extraction Attack from Counterfactual Explanations [57.46134660974256]
Cloud service providers have launched Machine-Learning-as-a-Service platforms to allow users to access large-scale cloudbased models via APIs. Such extra information inevitably causes the cloud models to be more vulnerable to extraction attacks. We propose a novel simple yet efficient querying strategy to greatly enhance the querying efficiency to steal a classification model.
arXiv Detail & Related papers (2022-05-13T08:24:43Z)
Performance Modeling of Metric-Based Serverless Computing Platforms [5.089110111757978]
The proposed performance model can help developers and providers predict the performance and cost of deployments with different configurations. We validate the applicability and accuracy of the proposed performance model by extensive real-world experimentation on Knative.
arXiv Detail & Related papers (2022-02-23T00:39:01Z)
SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines. This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z)
Serverless inferencing on Kubernetes [0.0]
We will discuss the KFServing project which builds on the KNative serverless paradigm to provide a serverless machine learning inference solution. We will show how it solves the challenges of autoscaling GPU based inference and discuss some of the lessons learnt from using it in production.
arXiv Detail & Related papers (2020-07-14T21:23:59Z)
Superiority of Simplicity: A Lightweight Model for Network Device Workload Prediction [58.98112070128482]
We propose a lightweight solution for series prediction based on historic observations. It consists of a heterogeneous ensemble method composed of two models - a neural network and a mean predictor. It achieves an overall $R2$ score of 0.10 on the available FedCSIS 2020 challenge dataset.
arXiv Detail & Related papers (2020-07-07T15:44:16Z)
MLModelCI: An Automatic Cloud Platform for Efficient MLaaS [15.029094196394862]
We release the platform as an open-source project on GitHub under Apache 2.0 license. Our system bridges the gap between current ML training and serving systems and thus free developers from manual and tedious work often associated with service deployment.
arXiv Detail & Related papers (2020-06-09T07:48:20Z)
Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers [6.56704851092678]
We analyze distributed training performance under diverse cluster configurations using CM-DARE. Our empirical datasets include measurements from three GPU types, six geographic regions, twenty convolutional neural networks, and thousands of Google Cloud servers. We also demonstrate the feasibility of predicting training speed and overhead using regression-based models.
arXiv Detail & Related papers (2020-04-07T01:49:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.