SMLT: A Serverless Framework for Scalable and Adaptive Machine Learning
Design and Training
- URL: http://arxiv.org/abs/2205.01853v1
- Date: Wed, 4 May 2022 02:11:26 GMT
- Title: SMLT: A Serverless Framework for Scalable and Adaptive Machine Learning
Design and Training
- Authors: Ahsan Ali, Syed Zawad, Paarijaat Aditya, Istemi Ekin Akkus, Ruichuan
Chen, Feng Yan
- Abstract summary: We propose SMLT, an automated, scalable, and adaptive serverless framework to enable efficient and user-centric ML design and training.
SMLT employs an automated and adaptive scheduling mechanism to dynamically optimize the deployment and resource scaling for ML tasks during training.
Our experimental evaluation with large, sophisticated modern ML models demonstrate that SMLT outperforms the state-of-the-art VM based systems and existing serverless ML training frameworks in both training speed (up to 8X) and monetary cost (up to 3X)
- Score: 4.015081523508339
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In today's production machine learning (ML) systems, models are continuously
trained, improved, and deployed. ML design and training are becoming a
continuous workflow of various tasks that have dynamic resource demands.
Serverless computing is an emerging cloud paradigm that provides transparent
resource management and scaling for users and has the potential to
revolutionize the routine of ML design and training. However, hosting modern ML
workflows on existing serverless platforms has non-trivial challenges due to
their intrinsic design limitations such as stateless nature, limited
communication support across function instances, and limited function execution
duration. These limitations result in a lack of an overarching view and
adaptation mechanism for training dynamics and an amplification of existing
problems in ML workflows.
To address the above challenges, we propose SMLT, an automated, scalable, and
adaptive serverless framework to enable efficient and user-centric ML design
and training. SMLT employs an automated and adaptive scheduling mechanism to
dynamically optimize the deployment and resource scaling for ML tasks during
training. SMLT further enables user-centric ML workflow execution by supporting
user-specified training deadlines and budget limits. In addition, by providing
an end-to-end design, SMLT solves the intrinsic problems in serverless
platforms such as the communication overhead, limited function execution
duration, need for repeated initialization, and also provides explicit fault
tolerance for ML training. SMLT is open-sourced and compatible with all major
ML frameworks. Our experimental evaluation with large, sophisticated modern ML
models demonstrate that SMLT outperforms the state-of-the-art VM based systems
and existing serverless ML training frameworks in both training speed (up to
8X) and monetary cost (up to 3X)
Related papers
- A Single Transformer for Scalable Vision-Language Modeling [74.05173379908703]
We present SOLO, a single transformer for visiOn-Language mOdeling.
A unified single Transformer architecture, like SOLO, effectively addresses these scalability concerns in LVLMs.
In this paper, we introduce the first open-source training recipe for developing SOLO, an open-source 7B LVLM.
arXiv Detail & Related papers (2024-07-08T22:40:15Z) - Machine Learning Insides OptVerse AI Solver: Design Principles and
Applications [74.67495900436728]
We present a comprehensive study on the integration of machine learning (ML) techniques into Huawei Cloud's OptVerse AI solver.
We showcase our methods for generating complex SAT and MILP instances utilizing generative models that mirror multifaceted structures of real-world problem.
We detail the incorporation of state-of-the-art parameter tuning algorithms which markedly elevate solver performance.
arXiv Detail & Related papers (2024-01-11T15:02:15Z) - A Multi-Head Ensemble Multi-Task Learning Approach for Dynamical
Computation Offloading [62.34538208323411]
We propose a multi-head ensemble multi-task learning (MEMTL) approach with a shared backbone and multiple prediction heads (PHs)
MEMTL outperforms benchmark methods in both the inference accuracy and mean square error without requiring additional training data.
arXiv Detail & Related papers (2023-09-02T11:01:16Z) - Towards Self-Adaptive Machine Learning-Enabled Systems Through QoS-Aware
Model Switching [1.2277343096128712]
We propose the concept of a Machine Learning Model Balancer, focusing on managing uncertainties related to ML models by using multiple models.
AdaMLS is a novel self-adaptation approach that leverages this concept and extends the traditional MAPE-K loop for continuous MLS adaptation.
Preliminary results suggest AdaMLS surpasses naive and single state-of-the-art models in guarantees.
arXiv Detail & Related papers (2023-08-19T09:33:51Z) - In Situ Framework for Coupling Simulation and Machine Learning with
Application to CFD [51.04126395480625]
Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations.
As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks.
This work offers a solution by simplifying this coupling and enabling in situ training and inference on heterogeneous clusters.
arXiv Detail & Related papers (2023-06-22T14:07:54Z) - MLCopilot: Unleashing the Power of Large Language Models in Solving
Machine Learning Tasks [31.733088105662876]
We aim to bridge the gap between machine intelligence and human knowledge by introducing a novel framework.
We showcase the possibility of extending the capability of LLMs to comprehend structured inputs and perform thorough reasoning for solving novel ML tasks.
arXiv Detail & Related papers (2023-04-28T17:03:57Z) - M$^3$ViT: Mixture-of-Experts Vision Transformer for Efficient Multi-task
Learning with Model-Accelerator Co-design [95.41238363769892]
Multi-task learning (MTL) encapsulates multiple learned tasks in a single model and often lets those tasks learn better jointly.
Current MTL regimes have to activate nearly the entire model even to just execute a single task.
We present a model-accelerator co-design framework to enable efficient on-device MTL.
arXiv Detail & Related papers (2022-10-26T15:40:24Z) - BPMN4sML: A BPMN Extension for Serverless Machine Learning. Technology
Independent and Interoperable Modeling of Machine Learning Workflows and
their Serverless Deployment Orchestration [0.0]
Machine learning (ML) continues to permeate all layers of academia, industry and society.
Business Process Model and Notation (BPMN) is widely accepted and applied.
BPMN is short of specific support to represent machine learning.
We introduce BPMN4sML (BPMN for serverless machine learning)
arXiv Detail & Related papers (2022-08-02T10:36:00Z) - Walle: An End-to-End, General-Purpose, and Large-Scale Production System
for Device-Cloud Collaborative Machine Learning [40.09527159285327]
We build the first end-to-end and general-purpose system, called Walle, for device-cloud collaborative machine learning (ML)
Walle consists of a deployment platform, distributing ML tasks to billion-scale devices in time; a data pipeline, efficiently preparing task input; and a compute container, providing a cross-platform and high-performance execution environment.
We evaluate Walle in practical e-commerce application scenarios to demonstrate its effectiveness, efficiency, and scalability.
arXiv Detail & Related papers (2022-05-30T03:43:35Z) - A Unified Transferable Model for ML-Enhanced DBMS [53.46830627879208]
We propose a unified model MTMLF that uses a multi-task training procedure to capture the transferable knowledge across tasks and a pretrain finetune procedure to distill the meta knowledge across DBs.
We believe this paradigm is more suitable for cloud DB service, and has the potential to revolutionize the way how ML is used in the future.
arXiv Detail & Related papers (2021-05-06T03:31:32Z) - Robust MAML: Prioritization task buffer with adaptive learning process
for model-agnostic meta-learning [15.894925018423665]
Model agnostic meta-learning (MAML) is a popular state-of-the-art meta-learning algorithm.
This paper proposes a more robust MAML based on an adaptive learning scheme and a prioritization task buffer.
Experimental results on meta reinforcement learning environments demonstrate a substantial performance gain.
arXiv Detail & Related papers (2021-03-15T09:34:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.