Towards Demystifying Serverless Machine Learning Training
- URL: http://arxiv.org/abs/2105.07806v1
- Date: Mon, 17 May 2021 13:19:23 GMT
- Title: Towards Demystifying Serverless Machine Learning Training
- Authors: Jiawei Jiang, Shaoduo Gan, Yue Liu, Fanlin Wang, Gustavo Alonso, Ana
Klimovic, Ankit Singla, Wentao Wu, Ce Zhang
- Abstract summary: We present a systematic, comparative study of distributed machine learning training over serverless infrastructures.
We develop an analytic model to capture cost/performance tradeoffs that must be considered when opting for serverless infrastructure.
- Score: 19.061432528378788
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The appeal of serverless (FaaS) has triggered a growing interest on how to
use it in data-intensive applications such as ETL, query processing, or machine
learning (ML). Several systems exist for training large-scale ML models on top
of serverless infrastructures (e.g., AWS Lambda) but with inconclusive results
in terms of their performance and relative advantage over "serverful"
infrastructures (IaaS). In this paper we present a systematic, comparative
study of distributed ML training over FaaS and IaaS. We present a design space
covering design choices such as optimization algorithms and synchronization
protocols, and implement a platform, LambdaML, that enables a fair comparison
between FaaS and IaaS. We present experimental results using LambdaML, and
further develop an analytic model to capture cost/performance tradeoffs that
must be considered when opting for a serverless infrastructure. Our results
indicate that ML training pays off in serverless only for models with efficient
(i.e., reduced) communication and that quickly converge. In general, FaaS can
be much faster but it is never significantly cheaper than IaaS.
Related papers
- R-SFLLM: Jamming Resilient Framework for Split Federated Learning with Large Language Models [83.77114091471822]
Split federated learning (SFL) is a compute-efficient paradigm in distributed machine learning (ML)
A challenge in SFL, particularly when deployed over wireless channels, is the susceptibility of transmitted model parameters to adversarial jamming.
This is particularly pronounced for word embedding parameters in large language models (LLMs), which are crucial for language understanding.
A physical layer framework is developed for resilient SFL with LLMs (R-SFLLM) over wireless networks.
arXiv Detail & Related papers (2024-07-16T12:21:29Z) - SpaFL: Communication-Efficient Federated Learning with Sparse Models and Low computational Overhead [75.87007729801304]
SpaFL: a communication-efficient FL framework is proposed to optimize sparse model structures with low computational overhead.
Experiments show that SpaFL improves accuracy while requiring much less communication and computing resources compared to sparse baselines.
arXiv Detail & Related papers (2024-06-01T13:10:35Z) - FSD-Inference: Fully Serverless Distributed Inference with Scalable Cloud Communication [2.1301190271783317]
We present FSD-Inference, the first fully serverless and highly scalable system for distributed ML inference.
We introduce novel fully serverless communication schemes for ML inference workloads, leveraging both cloud-based publish-subscribe/queueing and object storage offerings.
arXiv Detail & Related papers (2024-03-22T13:31:24Z) - EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism [70.07661254213181]
We present EE-LLM, a framework for large-scale training and inference of early-exit large language models (LLMs)
Built upon Megatron-LM, EE-LLM implements a variety of algorithmic innovations and performance optimizations tailored to early exiting.
Our analytical and empirical study shows that EE-LLM achieves great training efficiency with negligible computational overhead.
arXiv Detail & Related papers (2023-12-08T09:31:50Z) - FederatedScope-LLM: A Comprehensive Package for Fine-tuning Large
Language Models in Federated Learning [70.38817963253034]
This paper first discusses these challenges of federated fine-tuning LLMs, and introduces our package FS-LLM as a main contribution.
We provide comprehensive federated parameter-efficient fine-tuning algorithm implementations and versatile programming interfaces for future extension in FL scenarios.
We conduct extensive experiments to validate the effectiveness of FS-LLM and benchmark advanced LLMs with state-of-the-art parameter-efficient fine-tuning algorithms in FL settings.
arXiv Detail & Related papers (2023-09-01T09:40:36Z) - In Situ Framework for Coupling Simulation and Machine Learning with
Application to CFD [51.04126395480625]
Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations.
As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks.
This work offers a solution by simplifying this coupling and enabling in situ training and inference on heterogeneous clusters.
arXiv Detail & Related papers (2023-06-22T14:07:54Z) - Cheaply Evaluating Inference Efficiency Metrics for Autoregressive
Transformer APIs [66.30706841821123]
Large language models (LLMs) power many state-of-the-art systems in natural language processing.
LLMs are extremely computationally expensive, even at inference time.
We propose a new metric for comparing inference efficiency across models.
arXiv Detail & Related papers (2023-05-03T21:51:42Z) - ezDPS: An Efficient and Zero-Knowledge Machine Learning Inference
Pipeline [2.0813318162800707]
We propose ezDPS, a new efficient and zero-knowledge Machine Learning inference scheme.
ezDPS is a zkML pipeline in which the data is processed in multiple stages for high accuracy.
We show that ezDPS achieves one-to-three orders of magnitude more efficient than the generic circuit-based approach in all metrics.
arXiv Detail & Related papers (2022-12-11T06:47:28Z) - Cost Effective MLaaS Federation: A Combinatorial Reinforcement Learning
Approach [9.50492686145041]
Federating different MLes together allows us to improve the analytic performance further.
naively aggregating results from different MLes not only incurs significant momentary cost but also may lead to sub-optimal performance gain.
We propose a framework fed Armol to unify the right selection of ML providers to achieve the best possible analytic performance.
arXiv Detail & Related papers (2022-04-29T09:44:04Z) - Evaluation and Optimization of Distributed Machine Learning Techniques
for Internet of Things [34.544836653715244]
Federated learning (FL) and split learning (SL) are state-of-the-art distributed machine learning techniques.
Recent FL and SL are combined to form splitfed learning (SFL) to leverage each of their benefits.
This work considers FL, SL, and SFL, and mount them on Raspberry Pi devices to evaluate their performance.
arXiv Detail & Related papers (2021-03-03T23:55:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.