Adaptive Braking for Mitigating Gradient Delay
- URL: http://arxiv.org/abs/2007.01397v2
- Date: Fri, 10 Jul 2020 17:12:25 GMT
- Title: Adaptive Braking for Mitigating Gradient Delay
- Authors: Abhinav Venigalla and Atli Kosson and Vitaliy Chiley and Urs K\"oster
- Abstract summary: We introduce Adaptive Braking, a modification for momentum-based gradients that mitigates the effects of gradient delay.
We show that applying AB on top of SGD with momentum enables training ResNets on CIFAR-10 and ImageNet-1k with delays with minimal drop in final test accuracy.
- Score: 0.8602553195689513
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural network training is commonly accelerated by using multiple
synchronized workers to compute gradient updates in parallel. Asynchronous
methods remove synchronization overheads and improve hardware utilization at
the cost of introducing gradient delay, which impedes optimization and can lead
to lower final model performance. We introduce Adaptive Braking (AB), a
modification for momentum-based optimizers that mitigates the effects of
gradient delay. AB dynamically scales the gradient based on the alignment of
the gradient and the velocity. This can dampen oscillations along high
curvature directions of the loss surface, stabilizing and accelerating
asynchronous training. We show that applying AB on top of SGD with momentum
enables training ResNets on CIFAR-10 and ImageNet-1k with delays $D \geq$ 32
update steps with minimal drop in final test accuracy.
Related papers
- Inverse-Free Fast Natural Gradient Descent Method for Deep Learning [52.0693420699086]
We present a fast natural gradient descent (FNGD) method that only requires inversion during the first epoch.
FNGD exhibits similarities to the average sum in first-order methods, leading to the computational complexity of FNGD being comparable to that of first-order methods.
arXiv Detail & Related papers (2024-03-06T05:13:28Z) - MetaGrad: Adaptive Gradient Quantization with Hypernetworks [46.55625589293897]
Quantization aware Training (QAT) accelerates the forward pass during the neural network training and inference.
In this work, we propose to solve this problem by incorporating the gradients into the computation graph of the next training via a hypernetwork.
Various experiments on CIFAR-10 dataset with different CNN network architectures demonstrate that our hypernetwork-based approach can effectively reduce the negative effect of gradient quantization noise.
arXiv Detail & Related papers (2023-03-04T07:26:34Z) - Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep
Models [158.19276683455254]
Adaptive gradient algorithms borrow the moving average idea of heavy ball acceleration to estimate accurate first second-order moments of gradient for accelerating convergence.
Nesterov acceleration converges faster than ball acceleration in theory and also in many empirical cases.
In this paper we develop a new Nesterov momentum estimation (NME) method, which avoids the extra computation and memory overhead of computing gradient at the point.
We show that Adan surpasses the corresponding SoTAs on both vision transformers (ViTs and CNNs) and sets new SoTAs for many popular networks.
arXiv Detail & Related papers (2022-08-13T16:04:39Z) - GBA: A Tuning-free Approach to Switch between Synchronous and
Asynchronous Training for Recommendation Model [19.65557684234458]
We propose Global Batch gradients Aggregation (GBA) over parameter server (PS)
A token-control process is implemented to assemble the gradients and decay the gradients with severe staleness.
Experiments on three industrial-scale recommendation tasks show that GBA is an effective tuning-free approach for switching.
arXiv Detail & Related papers (2022-05-23T05:22:42Z) - Joint inference and input optimization in equilibrium networks [68.63726855991052]
deep equilibrium model is a class of models that foregoes traditional network depth and instead computes the output of a network by finding the fixed point of a single nonlinear layer.
We show that there is a natural synergy between these two settings.
We demonstrate this strategy on various tasks such as training generative models while optimizing over latent codes, training models for inverse problems like denoising and inpainting, adversarial training and gradient based meta-learning.
arXiv Detail & Related papers (2021-11-25T19:59:33Z) - Adapting Stepsizes by Momentumized Gradients Improves Optimization and
Generalization [89.66571637204012]
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
arXiv Detail & Related papers (2021-06-22T03:13:23Z) - Scaling transition from momentum stochastic gradient descent to plain
stochastic gradient descent [1.7874193862154875]
The momentum gradient descent uses the accumulated gradient as the updated direction of the current parameters.
The plain gradient descent has not been corrected by the accumulated gradient.
The TSGD algorithm has faster training speed, higher accuracy and better stability.
arXiv Detail & Related papers (2021-06-12T11:42:04Z) - Decreasing scaling transition from adaptive gradient descent to
stochastic gradient descent [1.7874193862154875]
We propose a decreasing scaling transition from adaptive gradient descent to gradient descent method DSTAda.
Our experimental results show that DSTAda has a faster speed, higher accuracy, and better stability and robustness.
arXiv Detail & Related papers (2021-06-12T11:28:58Z) - GradInit: Learning to Initialize Neural Networks for Stable and
Efficient Training [59.160154997555956]
We present GradInit, an automated and architecture method for initializing neural networks.
It is based on a simple agnostic; the variance of each network layer is adjusted so that a single step of SGD or Adam results in the smallest possible loss value.
It also enables training the original Post-LN Transformer for machine translation without learning rate warmup.
arXiv Detail & Related papers (2021-02-16T11:45:35Z) - Accumulated Decoupled Learning: Mitigating Gradient Staleness in
Inter-Layer Model Parallelization [16.02377434191239]
We propose an accumulated decoupled learning (ADL) which incorporates the gradient accumulation technique to mitigate the stale gradient effect.
We prove that the proposed method can converge to critical points, i.e., the gradients converge to 0, in spite of its asynchronous nature.
The ADL is shown to outperform several state-of-the-arts in the classification tasks, and is the fastest among the compared methods.
arXiv Detail & Related papers (2020-12-03T11:52:55Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.