Related papers: MLProxy: SLA-Aware Reverse Proxy for Machine Learning Inference Serving on Serverless Computing Platforms

MLProxy: SLA-Aware Reverse Proxy for Machine Learning Inference Serving on Serverless Computing Platforms

URL: http://arxiv.org/abs/2202.11243v1
Date: Wed, 23 Feb 2022 00:27:49 GMT
Title: MLProxy: SLA-Aware Reverse Proxy for Machine Learning Inference Serving on Serverless Computing Platforms
Authors: Nima Mahmoudi, Hamzeh Khazaei
Abstract summary: Serving machine learning inference workloads on the cloud is still a challenging task on the production level. Serverless computing has emerged in recent years to automate most infrastructure management tasks. We present ML Proxy, a reverse proxy to support efficient machine learning serving workloads on serverless computing systems.
Score: 5.089110111757978
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Serving machine learning inference workloads on the cloud is still a challenging task on the production level. Optimal configuration of the inference workload to meet SLA requirements while optimizing the infrastructure costs is highly complicated due to the complex interaction between batch configuration, resource configurations, and variable arrival process. Serverless computing has emerged in recent years to automate most infrastructure management tasks. Workload batching has revealed the potential to improve the response time and cost-effectiveness of machine learning serving workloads. However, it has not yet been supported out of the box by serverless computing platforms. Our experiments have shown that for various machine learning workloads, batching can hugely improve the system's efficiency by reducing the processing overhead per request. In this work, we present MLProxy, an adaptive reverse proxy to support efficient machine learning serving workloads on serverless computing systems. MLProxy supports adaptive batching to ensure SLA compliance while optimizing serverless costs. We performed rigorous experiments on Knative to demonstrate the effectiveness of MLProxy. We showed that MLProxy could reduce the cost of serverless deployment by up to 92% while reducing SLA violations by up to 99% that can be generalized across state-of-the-art model serving frameworks.

Related papers

FSD-Inference: Fully Serverless Distributed Inference with Scalable Cloud Communication [2.1301190271783317]
We present FSD-Inference, the first fully serverless and highly scalable system for distributed ML inference. We introduce novel fully serverless communication schemes for ML inference workloads, leveraging both cloud-based publish-subscribe/queueing and object storage offerings.
arXiv Detail & Related papers (2024-03-22T13:31:24Z)
SpotServe: Serving Generative Large Language Models on Preemptible Instances [64.18638174004151]
SpotServe is the first distributed large language models serving system on preemptible instances. We show that SpotServe can reduce the P99 tail latency by 2.4 - 9.1x compared with the best existing LLM serving systems. We also show that SpotServe can leverage the price advantage of preemptive instances, saving 54% monetary cost compared with only using on-demand instances.
arXiv Detail & Related papers (2023-11-27T06:31:17Z)
Exploring the Impact of Serverless Computing on Peer To Peer Training Machine Learning [0.3441021278275805]
We introduce a novel architecture that combines serverless computing with P2P networks for distributed training. Our findings show a significant enhancement in computation time, with up to a 97.34% improvement compared to conventional P2P distributed training methods. Despite the cost-time trade-off, the serverless approach still holds promise due to its pay-as-you-go model.
arXiv Detail & Related papers (2023-09-25T13:51:07Z)
Serverless Federated AUPRC Optimization for Multi-Party Collaborative Imbalanced Data Mining [119.89373423433804]
Area Under Precision-Recall (AUPRC) was introduced as an effective metric. Serverless multi-party collaborative training can cut down the communications cost by avoiding the server node bottleneck. We propose a new ServerLess biAsed sTochastic gradiEnt (SLATE) algorithm to directly optimize the AUPRC.
arXiv Detail & Related papers (2023-08-06T06:51:32Z)
In Situ Framework for Coupling Simulation and Machine Learning with Application to CFD [51.04126395480625]
Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations. As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks. This work offers a solution by simplifying this coupling and enabling in situ training and inference on heterogeneous clusters.
arXiv Detail & Related papers (2023-06-22T14:07:54Z)
Architecting Peer-to-Peer Serverless Distributed Machine Learning Training for Improved Fault Tolerance [1.495380389108477]
Serverless computing is a new paradigm for cloud computing that uses functions as a computational unit. By distributing the workload, distributed machine learning can speed up the training process and allow more complex models to be trained. We propose exploring the use of serverless computing in distributed machine learning training and comparing the performance of P2P architecture with the parameter server architecture.
arXiv Detail & Related papers (2023-02-27T17:38:47Z)
Multi-Agent Automated Machine Learning [54.14038920246645]
We propose multi-agent automated machine learning (MA2ML) to handle joint optimization of modules in automated machine learning (AutoML) MA2ML explicitly assigns credit to each agent according to its marginal contribution to enhance cooperation among modules, and incorporates off-policy learning to improve search efficiency. Experiments show that MA2ML yields the state-of-the-art top-1 accuracy on ImageNet under constraints of computational cost.
arXiv Detail & Related papers (2022-10-17T13:32:59Z)
SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines. This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z)
Towards Demystifying Serverless Machine Learning Training [19.061432528378788]
We present a systematic, comparative study of distributed machine learning training over serverless infrastructures. We develop an analytic model to capture cost/performance tradeoffs that must be considered when opting for serverless infrastructure.
arXiv Detail & Related papers (2021-05-17T13:19:23Z)
AI-based Resource Allocation: Reinforcement Learning for Adaptive Auto-scaling in Serverless Environments [0.0]
Serverless computing has emerged as a compelling new paradigm of cloud computing models in recent years. A common approach among both commercial and open source serverless computing platforms is workload-based auto-scaling. In this paper we investigate the applicability of a reinforcement learning approach to request-based auto-scaling in a serverless framework.
arXiv Detail & Related papers (2020-05-29T06:18:39Z)
Dynamic Parameter Allocation in Parameter Servers [74.250687861348]
We propose to integrate dynamic parameter allocation into parameter servers, describe an efficient implementation of such a parameter server called Lapse. We found that Lapse provides near-linear scaling and can be orders of magnitude faster than existing parameter servers.
arXiv Detail & Related papers (2020-02-03T11:37:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.