SMLT: A Serverless Framework for Scalable and Adaptive Machine Learning
Design and Training
- URL: http://arxiv.org/abs/2205.01853v1
- Date: Wed, 4 May 2022 02:11:26 GMT
- Title: SMLT: A Serverless Framework for Scalable and Adaptive Machine Learning
Design and Training
- Authors: Ahsan Ali, Syed Zawad, Paarijaat Aditya, Istemi Ekin Akkus, Ruichuan
Chen, Feng Yan
- Abstract summary: We propose SMLT, an automated, scalable, and adaptive serverless framework to enable efficient and user-centric ML design and training.
SMLT employs an automated and adaptive scheduling mechanism to dynamically optimize the deployment and resource scaling for ML tasks during training.
Our experimental evaluation with large, sophisticated modern ML models demonstrate that SMLT outperforms the state-of-the-art VM based systems and existing serverless ML training frameworks in both training speed (up to 8X) and monetary cost (up to 3X)
- Score: 4.015081523508339
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In today's production machine learning (ML) systems, models are continuously
trained, improved, and deployed. ML design and training are becoming a
continuous workflow of various tasks that have dynamic resource demands.
Serverless computing is an emerging cloud paradigm that provides transparent
resource management and scaling for users and has the potential to
revolutionize the routine of ML design and training. However, hosting modern ML
workflows on existing serverless platforms has non-trivial challenges due to
their intrinsic design limitations such as stateless nature, limited
communication support across function instances, and limited function execution
duration. These limitations result in a lack of an overarching view and
adaptation mechanism for training dynamics and an amplification of existing
problems in ML workflows.
To address the above challenges, we propose SMLT, an automated, scalable, and
adaptive serverless framework to enable efficient and user-centric ML design
and training. SMLT employs an automated and adaptive scheduling mechanism to
dynamically optimize the deployment and resource scaling for ML tasks during
training. SMLT further enables user-centric ML workflow execution by supporting
user-specified training deadlines and budget limits. In addition, by providing
an end-to-end design, SMLT solves the intrinsic problems in serverless
platforms such as the communication overhead, limited function execution
duration, need for repeated initialization, and also provides explicit fault
tolerance for ML training. SMLT is open-sourced and compatible with all major
ML frameworks. Our experimental evaluation with large, sophisticated modern ML
models demonstrate that SMLT outperforms the state-of-the-art VM based systems
and existing serverless ML training frameworks in both training speed (up to
8X) and monetary cost (up to 3X)
Related papers
- DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution [114.61347672265076]
Development of MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms.
We propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR) that automatically adjusts the size of the activated MLLM.
DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance.
arXiv Detail & Related papers (2024-11-04T18:26:08Z) - LLMs Can Evolve Continually on Modality for X-Modal Reasoning [62.2874638875554]
Existing methods rely heavily on modal-specific pretraining and joint-modal tuning, leading to significant computational burdens when expanding to new modalities.
We propose PathWeave, a flexible and scalable framework with modal-Path sWitching and ExpAnsion abilities.
PathWeave performs comparably to state-of-the-art MLLMs while concurrently reducing parameter training burdens by 98.73%.
arXiv Detail & Related papers (2024-10-26T13:19:57Z) - Resource Allocation for Stable LLM Training in Mobile Edge Computing [11.366306689957353]
This paper explores a collaborative training framework that integrates mobile users with edge servers to optimize resource allocation.
We formulate a multi-objective optimization problem to minimize the total energy consumption and delay during training.
We also address the common issue of instability in model performance by incorporating stability enhancements into our objective function.
arXiv Detail & Related papers (2024-09-30T12:36:27Z) - MoExtend: Tuning New Experts for Modality and Task Extension [61.29100693866109]
MoExtend is an effective framework designed to streamline the modality adaptation and extension of Mixture-of-Experts (MoE) models.
MoExtend seamlessly integrates new experts into pre-trained MoE models, endowing them with novel knowledge without the need to tune pretrained models.
arXiv Detail & Related papers (2024-08-07T02:28:37Z) - Machine Learning Insides OptVerse AI Solver: Design Principles and
Applications [74.67495900436728]
We present a comprehensive study on the integration of machine learning (ML) techniques into Huawei Cloud's OptVerse AI solver.
We showcase our methods for generating complex SAT and MILP instances utilizing generative models that mirror multifaceted structures of real-world problem.
We detail the incorporation of state-of-the-art parameter tuning algorithms which markedly elevate solver performance.
arXiv Detail & Related papers (2024-01-11T15:02:15Z) - Towards Self-Adaptive Machine Learning-Enabled Systems Through QoS-Aware
Model Switching [1.2277343096128712]
We propose the concept of a Machine Learning Model Balancer, focusing on managing uncertainties related to ML models by using multiple models.
AdaMLS is a novel self-adaptation approach that leverages this concept and extends the traditional MAPE-K loop for continuous MLS adaptation.
Preliminary results suggest AdaMLS surpasses naive and single state-of-the-art models in guarantees.
arXiv Detail & Related papers (2023-08-19T09:33:51Z) - In Situ Framework for Coupling Simulation and Machine Learning with
Application to CFD [51.04126395480625]
Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations.
As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks.
This work offers a solution by simplifying this coupling and enabling in situ training and inference on heterogeneous clusters.
arXiv Detail & Related papers (2023-06-22T14:07:54Z) - BPMN4sML: A BPMN Extension for Serverless Machine Learning. Technology
Independent and Interoperable Modeling of Machine Learning Workflows and
their Serverless Deployment Orchestration [0.0]
Machine learning (ML) continues to permeate all layers of academia, industry and society.
Business Process Model and Notation (BPMN) is widely accepted and applied.
BPMN is short of specific support to represent machine learning.
We introduce BPMN4sML (BPMN for serverless machine learning)
arXiv Detail & Related papers (2022-08-02T10:36:00Z) - Walle: An End-to-End, General-Purpose, and Large-Scale Production System
for Device-Cloud Collaborative Machine Learning [40.09527159285327]
We build the first end-to-end and general-purpose system, called Walle, for device-cloud collaborative machine learning (ML)
Walle consists of a deployment platform, distributing ML tasks to billion-scale devices in time; a data pipeline, efficiently preparing task input; and a compute container, providing a cross-platform and high-performance execution environment.
We evaluate Walle in practical e-commerce application scenarios to demonstrate its effectiveness, efficiency, and scalability.
arXiv Detail & Related papers (2022-05-30T03:43:35Z) - A Unified Transferable Model for ML-Enhanced DBMS [53.46830627879208]
We propose a unified model MTMLF that uses a multi-task training procedure to capture the transferable knowledge across tasks and a pretrain finetune procedure to distill the meta knowledge across DBs.
We believe this paradigm is more suitable for cloud DB service, and has the potential to revolutionize the way how ML is used in the future.
arXiv Detail & Related papers (2021-05-06T03:31:32Z) - Robust MAML: Prioritization task buffer with adaptive learning process
for model-agnostic meta-learning [15.894925018423665]
Model agnostic meta-learning (MAML) is a popular state-of-the-art meta-learning algorithm.
This paper proposes a more robust MAML based on an adaptive learning scheme and a prioritization task buffer.
Experimental results on meta reinforcement learning environments demonstrate a substantial performance gain.
arXiv Detail & Related papers (2021-03-15T09:34:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.