Parallel Training of Deep Networks with Local Updates
- URL: http://arxiv.org/abs/2012.03837v1
- Date: Mon, 7 Dec 2020 16:38:45 GMT
- Title: Parallel Training of Deep Networks with Local Updates
- Authors: Michael Laskin, Luke Metz, Seth Nabarrao, Mark Saroufim, Badreddine
Noune, Carlo Luschi, Jascha Sohl-Dickstein, Pieter Abbeel
- Abstract summary: Local parallelism is a framework which parallelizes training of individual layers in deep networks by replacing global backpropagation with truncated layer-wise backpropagation.
We show results in both vision and language domains across a diverse set of architectures, and find that local parallelism is particularly effective in the high-compute regime.
- Score: 84.30918922367442
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep learning models trained on large data sets have been widely successful
in both vision and language domains. As state-of-the-art deep learning
architectures have continued to grow in parameter count so have the compute
budgets and times required to train them, increasing the need for
compute-efficient methods that parallelize training. Two common approaches to
parallelize the training of deep networks have been data and model parallelism.
While useful, data and model parallelism suffer from diminishing returns in
terms of compute efficiency for large batch sizes. In this paper, we
investigate how to continue scaling compute efficiently beyond the point of
diminishing returns for large batches through local parallelism, a framework
which parallelizes training of individual layers in deep networks by replacing
global backpropagation with truncated layer-wise backpropagation. Local
parallelism enables fully asynchronous layer-wise parallelism with a low memory
footprint, and requires little communication overhead compared with model
parallelism. We show results in both vision and language domains across a
diverse set of architectures, and find that local parallelism is particularly
effective in the high-compute regime.
Related papers
- Retentive Network: A Successor to Transformer for Large Language Models [91.6652200825638]
We propose Retentive Network (RetNet) as a foundation architecture for large language models.
We theoretically derive the connection between recurrence and attention.
Experimental results on language modeling show that RetNet achieves favorable scaling results, parallel training, low-cost deployment, and efficient inference.
arXiv Detail & Related papers (2023-07-17T16:40:01Z) - SWARM Parallelism: Training Large Models Can Be Surprisingly
Communication-Efficient [69.61083127540776]
Deep learning applications benefit from using large models with billions of parameters.
Training these models is notoriously expensive due to the need for specialized HPC clusters.
We consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions.
arXiv Detail & Related papers (2023-01-27T18:55:19Z) - On Optimizing the Communication of Model Parallelism [74.15423270435949]
We study a novel and important communication pattern in large-scale model-parallel deep learning (DL)
In cross-mesh resharding, a sharded tensor needs to be sent from a source device mesh to a destination device mesh.
We propose two contributions to address cross-mesh resharding: an efficient broadcast-based communication system, and an "overlapping-friendly" pipeline schedule.
arXiv Detail & Related papers (2022-11-10T03:56:48Z) - SplitBrain: Hybrid Data and Model Parallel Deep Learning [11.63431725146897]
This paper presents SplitBrain, a high performance distributed deep learning framework supporting hybrid data and model parallelism.
Specifically, SplitBrain provides layer-specific partitioning that co-locates compute intensive convolutional layers while sharding memory demanding layers.
Results show that SplitBrain can achieve nearly linear speedup while saving up to 67% of memory consumption for data and model parallel VGG over CIFAR-10.
arXiv Detail & Related papers (2021-12-31T06:25:38Z) - Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel
Training [23.633810934134065]
Colossal-AI can achieve up to 2.76 times training speedup on large-scale models.
System supports parallel training methods such as data, pipeline, tensor, and sequence parallelism.
arXiv Detail & Related papers (2021-10-28T04:45:55Z) - TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale
Language Models [60.23234205219347]
TeraPipe is a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models.
We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster.
arXiv Detail & Related papers (2021-02-16T07:34:32Z) - Towards a Scalable and Distributed Infrastructure for Deep Learning
Applications [4.4979162962108905]
Phylanx offers a productivity-oriented execution tree that can be executed on multiple nodes.
We present Phylanx that has the potential to alleviate shortcomings in distributed deep learning frameworks.
arXiv Detail & Related papers (2020-10-06T20:38:47Z) - Restructuring, Pruning, and Adjustment of Deep Models for Parallel
Distributed Inference [15.720414948573753]
We consider the parallel implementation of an already-trained deep model on multiple processing nodes (a.k.a. workers)
We propose RePurpose, a layer-wise model restructuring and pruning technique that guarantees the performance of the overall parallelized model.
We show that, compared to the existing methods, RePurpose significantly improves the efficiency of the distributed inference via parallel implementation.
arXiv Detail & Related papers (2020-08-19T06:44:41Z) - A Linear Algebraic Approach to Model Parallelism in Deep Learning [0.0]
Training deep neural networks (DNNs) in large-cluster computing environments is increasingly necessary, as networks grow in size and complexity.
We propose a linear-algebraic approach to model parallelism in deep learning, which allows parallel distribution of any tensor in the DNN.
We build distributed DNN layers using these parallel primitives, composed with sequential layer implementations, and demonstrate their application by building and training a distributed DNN using DistDL, a PyTorch and MPI-based distributed deep learning toolkit.
arXiv Detail & Related papers (2020-06-04T19:38:05Z) - Understanding the Effects of Data Parallelism and Sparsity on Neural
Network Training [126.49572353148262]
We study two factors in neural network training: data parallelism and sparsity.
Despite their promising benefits, understanding of their effects on neural network training remains elusive.
arXiv Detail & Related papers (2020-03-25T10:49:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.