How Can We Train Deep Learning Models Across Clouds and Continents? An Experimental Study
- URL: http://arxiv.org/abs/2306.03163v4
- Date: Sun, 2 Jun 2024 09:53:59 GMT
- Title: How Can We Train Deep Learning Models Across Clouds and Continents? An Experimental Study
- Authors: Alexander Erben, Ruben Mayer, Hans-Arno Jacobsen,
- Abstract summary: We evaluate the cost and throughput implications of training in different zones, continents, and clouds for representative CV, NLP, and ASR models.
We show how leveraging spot pricing enables a new cost-efficient way to train models with multiple cheap instance, trumping both more centralized and powerful hardware and even on-demand cloud offerings at competitive prices.
- Score: 57.97785297481162
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This paper aims to answer the question: Can deep learning models be cost-efficiently trained on a global market of spot VMs spanning different data centers and cloud providers? To provide guidance, we extensively evaluate the cost and throughput implications of training in different zones, continents, and clouds for representative CV, NLP, and ASR models. To expand the current training options further, we compare the scalability potential for hybrid-cloud scenarios by adding cloud resources to on-premise hardware to improve training throughput. Finally, we show how leveraging spot instance pricing enables a new cost-efficient way to train models with multiple cheap VMs, trumping both more centralized and powerful hardware and even on-demand cloud offerings at competitive prices.
Related papers
- Efficient Training of Large Vision Models via Advanced Automated Progressive Learning [96.71646528053651]
We present an advanced automated progressive learning (AutoProg) framework for efficient training of Large Vision Models (LVMs)
We introduce AutoProg-Zero, by enhancing the AutoProg framework with a novel zero-shot unfreezing schedule search.
Experiments show that AutoProg accelerates ViT pre-training by up to 1.85x on ImageNet and accelerates fine-tuning of diffusion models by up to 2.86x, with comparable or even higher performance.
arXiv Detail & Related papers (2024-09-06T16:24:24Z) - Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing [53.748685766139715]
Large language models (LLMs) excel in most NLP tasks but also require expensive cloud servers for deployment due to their size.
We propose a hybrid inference approach which combines their respective strengths to save cost and maintain quality.
In experiments our approach allows us to make up to 40% fewer calls to the large model, with no drop in response quality.
arXiv Detail & Related papers (2024-04-22T23:06:42Z) - PredictChain: Empowering Collaboration and Data Accessibility for AI in
a Decentralized Blockchain-based Marketplace [1.4364491422470593]
We propose a blockchain-based marketplace called "PredictChain" for predictive machine-learning models.
This marketplace enables users to upload datasets for training predictive machine learning models, request model training on previously uploaded datasets, or submit queries to trained models.
arXiv Detail & Related papers (2023-07-27T19:56:18Z) - Scavenger: A Cloud Service for Optimizing Cost and Performance of ML
Training [1.047192732651018]
We develop principled and practical techniques for optimizing the training time and cost of distributed ML model training on the cloud.
By combining conventional parallel scaling concepts and new insights into SGD noise, our models accurately estimate the time and cost on different cluster configurations with 5% error.
arXiv Detail & Related papers (2023-03-12T13:42:39Z) - Offline Q-Learning on Diverse Multi-Task Data Both Scales And
Generalizes [100.69714600180895]
offline Q-learning algorithms exhibit strong performance that scales with model capacity.
We train a single policy on 40 games with near-human performance using up-to 80 million parameter networks.
Compared to return-conditioned supervised approaches, offline Q-learning scales similarly with model capacity and has better performance, especially when the dataset is suboptimal.
arXiv Detail & Related papers (2022-11-28T08:56:42Z) - Decentralized Training of Foundation Models in Heterogeneous
Environments [77.47261769795992]
Training foundation models, such as GPT-3 and PaLM, can be extremely expensive.
We present the first study of training large foundation models with model parallelism in a decentralized regime over a heterogeneous network.
arXiv Detail & Related papers (2022-06-02T20:19:51Z) - Distributed Deep Learning Using Volunteer Computing-Like Paradigm [0.09668407688201358]
Training Deep Learning models with large number of parameters and/or large datasets can become prohibitive.
Current solutions, built predominantly for cluster computing systems, can still be an issue.
We design a distributed solution that can run DL training on a VC system by using a data parallel approach.
arXiv Detail & Related papers (2021-03-16T07:32:58Z) - Ensemble Distillation for Robust Model Fusion in Federated Learning [72.61259487233214]
Federated Learning (FL) is a machine learning setting where many devices collaboratively train a machine learning model.
In most of the current training schemes the central model is refined by averaging the parameters of the server model and the updated parameters from the client side.
We propose ensemble distillation for model fusion, i.e. training the central classifier through unlabeled data on the outputs of the models from the clients.
arXiv Detail & Related papers (2020-06-12T14:49:47Z) - Characterizing and Modeling Distributed Training with Transient Cloud
GPU Servers [6.56704851092678]
We analyze distributed training performance under diverse cluster configurations using CM-DARE.
Our empirical datasets include measurements from three GPU types, six geographic regions, twenty convolutional neural networks, and thousands of Google Cloud servers.
We also demonstrate the feasibility of predicting training speed and overhead using regression-based models.
arXiv Detail & Related papers (2020-04-07T01:49:58Z) - HierTrain: Fast Hierarchical Edge AI Learning with Hybrid Parallelism in
Mobile-Edge-Cloud Computing [36.40138484917463]
We propose HierTrain, a hierarchical edge AI learning framework, which efficiently deploys the DNN training task over the hierarchical MECC architecture.
We show that HierTrain can achieve up to 6.9x speedup compared to the cloud-based hierarchical training approach.
arXiv Detail & Related papers (2020-03-22T12:40:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.