Decentralized Training of Foundation Models in Heterogeneous
Environments
- URL: http://arxiv.org/abs/2206.01288v4
- Date: Wed, 21 Jun 2023 13:59:20 GMT
- Title: Decentralized Training of Foundation Models in Heterogeneous
Environments
- Authors: Binhang Yuan, Yongjun He, Jared Quincy Davis, Tianyi Zhang, Tri Dao,
Beidi Chen, Percy Liang, Christopher Re, Ce Zhang
- Abstract summary: Training foundation models, such as GPT-3 and PaLM, can be extremely expensive.
We present the first study of training large foundation models with model parallelism in a decentralized regime over a heterogeneous network.
- Score: 77.47261769795992
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training foundation models, such as GPT-3 and PaLM, can be extremely
expensive, often involving tens of thousands of GPUs running continuously for
months. These models are typically trained in specialized clusters featuring
fast, homogeneous interconnects and using carefully designed software systems
that support both data parallelism and model/pipeline parallelism. Such
dedicated clusters can be costly and difficult to obtain. Can we instead
leverage the much greater amount of decentralized, heterogeneous, and
lower-bandwidth interconnected compute? Previous works examining the
heterogeneous, decentralized setting focus on relatively small models that can
be trained in a purely data parallel manner. State-of-the-art schemes for model
parallel foundation model training, such as Megatron, only consider the
homogeneous data center setting. In this paper, we present the first study of
training large foundation models with model parallelism in a decentralized
regime over a heterogeneous network. Our key technical contribution is a
scheduling algorithm that allocates different computational "tasklets" in the
training of foundation models to a group of decentralized GPU devices connected
by a slow heterogeneous network. We provide a formal cost model and further
propose an efficient evolutionary algorithm to find the optimal allocation
strategy. We conduct extensive experiments that represent different scenarios
for learning over geo-distributed devices simulated using real-world network
measurements. In the most extreme case, across 8 different cities spanning 3
continents, our approach is 4.8X faster than prior state-of-the-art training
systems (Megatron).
Related papers
- From promise to practice: realizing high-performance decentralized training [8.955918346078935]
Decentralized training of deep neural networks has attracted significant attention for its theoretically superior scalability over synchronous data-parallel methods like All-Reduce.
This paper identifies three key factors that can lead to speedups over All-Reduce training and constructs a runtime model to determine when, how, and to what degree decentralization can yield shorter per-it runtimes.
arXiv Detail & Related papers (2024-10-15T19:04:56Z) - ATOM: Asynchronous Training of Massive Models for Deep Learning in a Decentralized Environment [7.916080032572087]
atom is a resilient distributed training framework designed for asynchronous training of vast models in a decentralized setting.
atom aims to accommodate a complete LLM on one host (peer) through seamlessly model swapping and concurrently trains multiple copies across various peers to optimize training throughput.
Our experiments using different GPT-3 model configurations reveal that, in scenarios with suboptimal network connections, atom can enhance training efficiency up to $20 times$ when juxtaposed with the state-of-the-art decentralized pipeline parallelism approaches.
arXiv Detail & Related papers (2024-03-15T17:43:43Z) - Towards a Better Theoretical Understanding of Independent Subnetwork Training [56.24689348875711]
We take a closer theoretical look at Independent Subnetwork Training (IST)
IST is a recently proposed and highly effective technique for solving the aforementioned problems.
We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication.
arXiv Detail & Related papers (2023-06-28T18:14:22Z) - SWARM Parallelism: Training Large Models Can Be Surprisingly
Communication-Efficient [69.61083127540776]
Deep learning applications benefit from using large models with billions of parameters.
Training these models is notoriously expensive due to the need for specialized HPC clusters.
We consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions.
arXiv Detail & Related papers (2023-01-27T18:55:19Z) - Supernet Training for Federated Image Classification under System
Heterogeneity [15.2292571922932]
In this work, we propose a novel framework to consider both scenarios, namely Federation of Supernet Training (FedSup)
It is inspired by how averaging parameters in the model aggregation stage of Federated Learning (FL) is similar to weight-sharing in supernet training.
Under our framework, we present an efficient algorithm (E-FedSup) by sending the sub-model to clients in the broadcast stage for reducing communication costs and training overhead.
arXiv Detail & Related papers (2022-06-03T02:21:01Z) - Parallel Successive Learning for Dynamic Distributed Model Training over
Heterogeneous Wireless Networks [50.68446003616802]
Federated learning (FedL) has emerged as a popular technique for distributing model training over a set of wireless devices.
We develop parallel successive learning (PSL), which expands the FedL architecture along three dimensions.
Our analysis sheds light on the notion of cold vs. warmed up models, and model inertia in distributed machine learning.
arXiv Detail & Related papers (2022-02-07T05:11:01Z) - Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel
Training [23.633810934134065]
Colossal-AI can achieve up to 2.76 times training speedup on large-scale models.
System supports parallel training methods such as data, pipeline, tensor, and sequence parallelism.
arXiv Detail & Related papers (2021-10-28T04:45:55Z) - Simultaneous Training of Partially Masked Neural Networks [67.19481956584465]
We show that it is possible to train neural networks in such a way that a predefined 'core' subnetwork can be split-off from the trained full network with remarkable good performance.
We show that training a Transformer with a low-rank core gives a low-rank model with superior performance than when training the low-rank model alone.
arXiv Detail & Related papers (2021-06-16T15:57:51Z) - Clustered Federated Learning via Generalized Total Variation
Minimization [83.26141667853057]
We study optimization methods to train local (or personalized) models for local datasets with a decentralized network structure.
Our main conceptual contribution is to formulate federated learning as total variation minimization (GTV)
Our main algorithmic contribution is a fully decentralized federated learning algorithm.
arXiv Detail & Related papers (2021-05-26T18:07:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.