Related papers: Adaptive Braking for Mitigating Gradient Delay

Adaptive Braking for Mitigating Gradient Delay

URL: http://arxiv.org/abs/2007.01397v2
Date: Fri, 10 Jul 2020 17:12:25 GMT
Title: Adaptive Braking for Mitigating Gradient Delay
Authors: Abhinav Venigalla and Atli Kosson and Vitaliy Chiley and Urs K\"oster
Abstract summary: We introduce Adaptive Braking, a modification for momentum-based gradients that mitigates the effects of gradient delay. We show that applying AB on top of SGD with momentum enables training ResNets on CIFAR-10 and ImageNet-1k with delays with minimal drop in final test accuracy.
Score: 0.8602553195689513
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Neural network training is commonly accelerated by using multiple synchronized workers to compute gradient updates in parallel. Asynchronous methods remove synchronization overheads and improve hardware utilization at the cost of introducing gradient delay, which impedes optimization and can lead to lower final model performance. We introduce Adaptive Braking (AB), a modification for momentum-based optimizers that mitigates the effects of gradient delay. AB dynamically scales the gradient based on the alignment of the gradient and the velocity. This can dampen oscillations along high curvature directions of the loss surface, stabilizing and accelerating asynchronous training. We show that applying AB on top of SGD with momentum enables training ResNets on CIFAR-10 and ImageNet-1k with delays $D \geq$ 32 update steps with minimal drop in final test accuracy.

Related papers

Decentralized Nonconvex Composite Federated Learning with Gradient Tracking and Momentum [78.27945336558987]
Decentralized server (DFL) eliminates reliance on client-client architecture. Non-smooth regularization is often incorporated into machine learning tasks. We propose a novel novel DNCFL algorithm to solve these problems.
arXiv Detail & Related papers (2025-04-17T08:32:25Z)
Fast and Slow Gradient Approximation for Binary Neural Network Optimization [11.064044986709733]
hypernetwork based methods utilize neural networks to learn the gradients of non-differentiable quantization functions. We propose a Historical Gradient Storage (HGS) module, which models the historical gradient sequence to generate the first-order momentum required for optimization. We also introduce Layer Recognition Embeddings (LRE) into the hypernetwork, facilitating the generation of layer-specific fine gradients.
arXiv Detail & Related papers (2024-12-16T13:48:40Z)
Asynchronous Stochastic Gradient Descent with Decoupled Backpropagation and Layer-Wise Updates [1.9241821314180372]
Asynchronous gradient descent (ASGD) methods can improve training speed, but are sensitive to delays due to both communication and throughput differences. PD-ASGD uses separate threads for the forward and backward passes, decoupling the updates and allowing for a higher ratio of forward to backward threads. Our approach yields close to state-of-the-art results while running up to $5.95times$ faster than synchronous data parallelism in the presence of delays.
arXiv Detail & Related papers (2024-10-08T12:32:36Z)
Inverse-Free Fast Natural Gradient Descent Method for Deep Learning [52.0693420699086]
We present a fast natural gradient descent (FNGD) method that only requires inversion during the first epoch. FNGD exhibits similarities to the average sum in first-order methods, leading to the computational complexity of FNGD being comparable to that of first-order methods.
arXiv Detail & Related papers (2024-03-06T05:13:28Z)
MetaGrad: Adaptive Gradient Quantization with Hypernetworks [46.55625589293897]
Quantization aware Training (QAT) accelerates the forward pass during the neural network training and inference. In this work, we propose to solve this problem by incorporating the gradients into the computation graph of the next training via a hypernetwork. Various experiments on CIFAR-10 dataset with different CNN network architectures demonstrate that our hypernetwork-based approach can effectively reduce the negative effect of gradient quantization noise.
arXiv Detail & Related papers (2023-03-04T07:26:34Z)
Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models [158.19276683455254]
Adaptive gradient algorithms borrow the moving average idea of heavy ball acceleration to estimate accurate first second-order moments of gradient for accelerating convergence. Nesterov acceleration converges faster than ball acceleration in theory and also in many empirical cases. In this paper we develop a new Nesterov momentum estimation (NME) method, which avoids the extra computation and memory overhead of computing gradient at the point. We show that Adan surpasses the corresponding SoTAs on both vision transformers (ViTs and CNNs) and sets new SoTAs for many popular networks.
arXiv Detail & Related papers (2022-08-13T16:04:39Z)
GBA: A Tuning-free Approach to Switch between Synchronous and Asynchronous Training for Recommendation Model [19.65557684234458]
We propose Global Batch gradients Aggregation (GBA) over parameter server (PS) A token-control process is implemented to assemble the gradients and decay the gradients with severe staleness. Experiments on three industrial-scale recommendation tasks show that GBA is an effective tuning-free approach for switching.
arXiv Detail & Related papers (2022-05-23T05:22:42Z)
Joint inference and input optimization in equilibrium networks [68.63726855991052]
deep equilibrium model is a class of models that foregoes traditional network depth and instead computes the output of a network by finding the fixed point of a single nonlinear layer. We show that there is a natural synergy between these two settings. We demonstrate this strategy on various tasks such as training generative models while optimizing over latent codes, training models for inverse problems like denoising and inpainting, adversarial training and gradient based meta-learning.
arXiv Detail & Related papers (2021-11-25T19:59:33Z)
Adapting Stepsizes by Momentumized Gradients Improves Optimization and Generalization [89.66571637204012]
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing. textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing. textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
arXiv Detail & Related papers (2021-06-22T03:13:23Z)
Scaling transition from momentum stochastic gradient descent to plain stochastic gradient descent [1.7874193862154875]
The momentum gradient descent uses the accumulated gradient as the updated direction of the current parameters. The plain gradient descent has not been corrected by the accumulated gradient. The TSGD algorithm has faster training speed, higher accuracy and better stability.
arXiv Detail & Related papers (2021-06-12T11:42:04Z)
Decreasing scaling transition from adaptive gradient descent to stochastic gradient descent [1.7874193862154875]
We propose a decreasing scaling transition from adaptive gradient descent to gradient descent method DSTAda. Our experimental results show that DSTAda has a faster speed, higher accuracy, and better stability and robustness.
arXiv Detail & Related papers (2021-06-12T11:28:58Z)
GradInit: Learning to Initialize Neural Networks for Stable and Efficient Training [59.160154997555956]
We present GradInit, an automated and architecture method for initializing neural networks. It is based on a simple agnostic; the variance of each network layer is adjusted so that a single step of SGD or Adam results in the smallest possible loss value. It also enables training the original Post-LN Transformer for machine translation without learning rate warmup.
arXiv Detail & Related papers (2021-02-16T11:45:35Z)
Accumulated Decoupled Learning: Mitigating Gradient Staleness in Inter-Layer Model Parallelization [16.02377434191239]
We propose an accumulated decoupled learning (ADL) which incorporates the gradient accumulation technique to mitigate the stale gradient effect. We prove that the proposed method can converge to critical points, i.e., the gradients converge to 0, in spite of its asynchronous nature. The ADL is shown to outperform several state-of-the-arts in the classification tasks, and is the fastest among the compared methods.
arXiv Detail & Related papers (2020-12-03T11:52:55Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.