Scaling Distributed Training with Adaptive Summation
- URL: http://arxiv.org/abs/2006.02924v1
- Date: Thu, 4 Jun 2020 15:08:20 GMT
- Title: Scaling Distributed Training with Adaptive Summation
- Authors: Saeed Maleki, Madan Musuvathi, Todd Mytkowicz, Olli Saarikivi, Tianju
Xu, Vadim Eksarevskiy, Jaliya Ekanayake, Emad Barsoum
- Abstract summary: This paper introduces a novel method to combine gradients called Adasum (for adaptive sum) that converges faster than prior work.
Adasum is easy to implement, almost as efficient as simply summing gradients, and is integrated into the open-source toolkit Horovod.
- Score: 2.6210166639679
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Stochastic gradient descent (SGD) is an inherently sequential training
algorithm--computing the gradient at batch $i$ depends on the model parameters
learned from batch $i-1$. Prior approaches that break this dependence do not
honor them (e.g., sum the gradients for each batch, which is not what
sequential SGD would do) and thus potentially suffer from poor convergence.
This paper introduces a novel method to combine gradients called Adasum (for
adaptive sum) that converges faster than prior work. Adasum is easy to
implement, almost as efficient as simply summing gradients, and is integrated
into the open-source toolkit Horovod.
This paper first provides a formal justification for Adasum and then
empirically demonstrates Adasum is more accurate than prior gradient
accumulation methods. It then introduces a series of case-studies to show
Adasum works with multiple frameworks, (TensorFlow and PyTorch), scales
multiple optimizers (Momentum-SGD, Adam, and LAMB) to larger batch-sizes while
still giving good downstream accuracy. Finally, it proves that Adasum
converges.
To summarize, Adasum scales Momentum-SGD on the MLPerf Resnet50 benchmark to
64K examples before communication (no MLPerf v0.5 entry converged with more
than 16K), the Adam optimizer to 64K examples before communication on
BERT-LARGE (prior work showed Adam stopped scaling at 16K), and the LAMB
optimizer to 128K before communication on BERT-LARGE (prior work used 64K), all
while maintaining downstream accuracy metrics. Finally, if a user does not need
to scale, we show LAMB with Adasum on BERT-LARGE converges in 30% fewer steps
than the baseline.
Related papers
- AdaBatchGrad: Combining Adaptive Batch Size and Adaptive Step Size [42.84471753630676]
We present a novel adaptation of the Gradient Descent (SGD) called AdaBatchGrad.
It seamlessly integrates an adaptive step size with an adjustable batch size.
We experimentally show, how the introduction of adaptive step size and adaptive batch size gradually improves the performance of regular SGD.
arXiv Detail & Related papers (2024-02-07T21:19:05Z) - Convergence Analysis of Decentralized ASGD [1.8710230264817358]
We present a novel convergence-rate analysis for decentralized asynchronous SGD (DASGD) which does not require partial synchronization among nodes nor restrictive network topologies.
Our convergence proof holds for a fixed stepsize and any nonsmooth, homogeneous, L-shaped objective function.
arXiv Detail & Related papers (2023-09-07T14:50:31Z) - Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models [134.83964935755964]
In deep learning, different kinds of deep networks typically need different extrapolations, which have to be chosen after multiple trials.
To relieve this issue and consistently improve the model training speed deep networks, we propose the ADAtive Nesterov momentum Transformer.
arXiv Detail & Related papers (2022-08-13T16:04:39Z) - Sharper Convergence Guarantees for Asynchronous SGD for Distributed and
Federated Learning [77.22019100456595]
We show a training algorithm for distributed computation workers with varying communication frequency.
In this work, we obtain a tighter convergence rate of $mathcalO!!!(sigma2-2_avg!! .
We also show that the heterogeneity term in rate is affected by the average delay within each worker.
arXiv Detail & Related papers (2022-06-16T17:10:57Z) - Robust Training of Neural Networks using Scale Invariant Architectures [70.67803417918854]
In contrast to SGD, adaptive gradient methods like Adam allow robust training of modern deep networks.
We show that this general approach is robust to rescaling of parameter and loss.
We design a scale invariant version of BERT, called SIBERT, which when trained simply by vanilla SGD achieves performance comparable to BERT trained by adaptive methods like Adam.
arXiv Detail & Related papers (2022-02-02T11:58:56Z) - Towards Noise-adaptive, Problem-adaptive Stochastic Gradient Descent [7.176107039687231]
We design step-size schemes that make gradient descent (SGD) adaptive to (i) the noise.
We prove that $T$ iterations of SGD with Nesterov iterations can be near optimal.
Compared to other step-size schemes, we demonstrate the effectiveness of a novel novel exponential step-size scheme.
arXiv Detail & Related papers (2021-10-21T19:22:14Z) - Exploiting Adam-like Optimization Algorithms to Improve the Performance
of Convolutional Neural Networks [82.61182037130405]
gradient descent (SGD) is the main approach for training deep networks.
In this work, we compare Adam based variants based on the difference between the present and the past gradients.
We have tested ensemble of networks and the fusion with ResNet50 trained with gradient descent.
arXiv Detail & Related papers (2021-03-26T18:55:08Z) - Balancing Rates and Variance via Adaptive Batch-Size for Stochastic
Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error.
Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z) - MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of
Gradients [112.00379151834242]
We propose adaptive learning rate principle, in which the running mean of squared gradient in Adam is replaced by a weighted mean, with weights chosen to maximize the estimated variance each coordinate.
This results in faster adaptation, which leads more desirable empirical convergence behaviors.
arXiv Detail & Related papers (2020-06-21T21:47:43Z) - A Simple Convergence Proof of Adam and Adagrad [74.24716715922759]
We show a proof of convergence between the Adam Adagrad and $O(d(N)/st)$ algorithms.
Adam converges with the same convergence $O(d(N)/st)$ when used with the default parameters.
arXiv Detail & Related papers (2020-03-05T01:56:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.