DaSGD: Squeezing SGD Parallelization Performance in Distributed Training
Using Delayed Averaging
- URL: http://arxiv.org/abs/2006.00441v1
- Date: Sun, 31 May 2020 05:43:50 GMT
- Title: DaSGD: Squeezing SGD Parallelization Performance in Distributed Training
Using Delayed Averaging
- Authors: Qinggang Zhou, Yawen Zhang, Pengcheng Li, Xiaoyong Liu, Jun Yang,
Runsheng Wang and Ru Huang
- Abstract summary: Minibatch gradient descent (SGD) algorithm requires workers to halt forward/back propagations.
DaSGD parallelizes SGD and forward/back propagations to hide 100% of the communication overhead.
- Score: 4.652668321425679
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The state-of-the-art deep learning algorithms rely on distributed training
systems to tackle the increasing sizes of models and training data sets.
Minibatch stochastic gradient descent (SGD) algorithm requires workers to halt
forward/back propagations, to wait for gradients aggregated from all workers,
and to receive weight updates before the next batch of tasks. This synchronous
execution model exposes the overheads of gradient/weight communication among a
large number of workers in a distributed training system. We propose a new SGD
algorithm, DaSGD (Local SGD with Delayed Averaging), which parallelizes SGD and
forward/back propagations to hide 100% of the communication overhead. By
adjusting the gradient update scheme, this algorithm uses hardware resources
more efficiently and reduces the reliance on the low-latency and
high-throughput inter-connects. The theoretical analysis and the experimental
results show its convergence rate O(1/sqrt(K)), the same as SGD. The
performance evaluation demonstrates it enables a linear performance scale-up
with the cluster size.
Related papers
- FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - ABS-SGD: A Delayed Synchronous Stochastic Gradient Descent Algorithm
with Adaptive Batch Size for Heterogeneous GPU Clusters [9.885668723959125]
We propose a delayed synchronous distributed gradient descent algorithm with adaptive batch size (ABS-SGD) for heterogeneous GPU clusters.
In ABS-SGD, workers perform global synchronization to accumulate delayed gradients and use the accumulated delayed gradients to update parameters.
Extensive experiments in three types of heterogeneous clusters demonstrate that ABS-SGD can make full use of computational resources.
arXiv Detail & Related papers (2023-08-29T09:46:52Z) - Gradient Coding with Dynamic Clustering for Straggler-Tolerant
Distributed Learning [55.052517095437]
gradient descent (GD) is widely employed to parallelize the learning task by distributing the dataset across multiple workers.
A significant performance bottleneck for the per-iteration completion time in distributed synchronous GD is $straggling$ workers.
Coded distributed techniques have been introduced recently to mitigate stragglers and to speed up GD iterations by assigning redundant computations to workers.
We propose a novel dynamic GC scheme, which assigns redundant data to workers to acquire the flexibility to choose from among a set of possible codes depending on the past straggling behavior.
arXiv Detail & Related papers (2021-03-01T18:51:29Z) - Gradient Coding with Dynamic Clustering for Straggler Mitigation [57.9123881133818]
GC-DC regulates the number of straggling workers in each cluster based on the straggler behavior in the previous iteration.
We numerically show that GC-DC provides significant improvements in the average completion time (of each iteration) with no increase in the communication load compared to the original GC scheme.
arXiv Detail & Related papers (2020-11-03T18:52:15Z) - Sparse Communication for Training Deep Networks [56.441077560085475]
Synchronous gradient descent (SGD) is the most common method used for distributed training of deep learning models.
In this algorithm, each worker shares its local gradients with others and updates the parameters using the average gradients of all workers.
We study several compression schemes and identify how three key parameters affect the performance.
arXiv Detail & Related papers (2020-09-19T17:28:11Z) - HPSGD: Hierarchical Parallel SGD With Stale Gradients Featuring [18.8426865970643]
A novel Hierarchical Parallel SGD (HPSGD) strategy is proposed to boost the distributed training process of the deep neural network (DNN)
Experiments are conducted to demonstrate that the proposed HPSGD approach substantially boosts the distributed DNN training, reduces the disturbance of the stale gradients and achieves better accuracy in given fixed wall-time.
arXiv Detail & Related papers (2020-09-06T10:17:56Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - Communication-Efficient Distributed Stochastic AUC Maximization with
Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network.
Our model requires a much less number of communication rounds and still a number of communication rounds in theory.
Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z) - Breaking (Global) Barriers in Parallel Stochastic Optimization with
Wait-Avoiding Group Averaging [34.55741812648229]
We present WAGMA-SGD, a wait-avoiding subgroup that reduces global communication via weight exchange.
We train ResNet-50 on ImageNet; Transformer for machine translation; and deep reinforcement learning for navigation at scale.
Compared with state-of-the-art decentralized SGD variants, WAGMA-SGD significantly improves training throughput.
arXiv Detail & Related papers (2020-04-30T22:11:53Z) - Overlap Local-SGD: An Algorithmic Approach to Hide Communication Delays
in Distributed SGD [32.03967072200476]
We propose an algorithmic approach named OverlapLocal-Local-Local-SGD (Local momentum variant)
We achieve this by adding an anchor model on each node.
After multiple local updates, locally trained models will be pulled back towards the anchor model rather than communicating with others.
arXiv Detail & Related papers (2020-02-21T20:33:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.