Loss Landscape Dependent Self-Adjusting Learning Rates in Decentralized
Stochastic Gradient Descent
- URL: http://arxiv.org/abs/2112.01433v1
- Date: Thu, 2 Dec 2021 17:23:25 GMT
- Title: Loss Landscape Dependent Self-Adjusting Learning Rates in Decentralized
Stochastic Gradient Descent
- Authors: Wei Zhang, Mingrui Liu, Yu Feng, Xiaodong Cui, Brian Kingsbury, Yuhai
Tu
- Abstract summary: Distributed Deep Learning (DDL) is essential for large-scale Deep Learning (DL) training.
In a large batch setting, the learning rate must be increased to compensate for the reduced number of parameter updates.
Recently, Decentralized Parallel SGD (DPSGD) has been proposed to improve training speed.
- Score: 37.52828820578212
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Distributed Deep Learning (DDL) is essential for large-scale Deep Learning
(DL) training. Synchronous Stochastic Gradient Descent (SSGD) 1 is the de facto
DDL optimization method. Using a sufficiently large batch size is critical to
achieving DDL runtime speedup. In a large batch setting, the learning rate must
be increased to compensate for the reduced number of parameter updates.
However, a large learning rate may harm convergence in SSGD and training could
easily diverge. Recently, Decentralized Parallel SGD (DPSGD) has been proposed
to improve distributed training speed. In this paper, we find that DPSGD not
only has a system-wise run-time benefit but also a significant convergence
benefit over SSGD in the large batch setting. Based on a detailed analysis of
the DPSGD learning dynamics, we find that DPSGD introduces additional
landscape-dependent noise that automatically adjusts the effective learning
rate to improve convergence. In addition, we theoretically show that this noise
smoothes the loss landscape, hence allowing a larger learning rate. We conduct
extensive studies over 18 state-of-the-art DL models/tasks and demonstrate that
DPSGD often converges in cases where SSGD diverges for large learning rates in
the large batch setting. Our findings are consistent across two different
application domains: Computer Vision (CIFAR10 and ImageNet-1K) and Automatic
Speech Recognition (SWB300 and SWB2000), and two different types of neural
network models: Convolutional Neural Networks and Long Short-Term Memory
Recurrent Neural Networks.
Related papers
- Fractional-order spike-timing-dependent gradient descent for multi-layer spiking neural networks [18.142378139047977]
This paper proposes a fractional-order spike-timing-dependent gradient descent (FOSTDGD) learning model.
It is tested on theNIST and DVS128 Gesture datasets and its accuracy under different network structure and fractional orders is analyzed.
arXiv Detail & Related papers (2024-10-20T05:31:34Z) - Analyzing and Improving the Training Dynamics of Diffusion Models [36.37845647984578]
We identify and rectify several causes for uneven and ineffective training in the popular ADM diffusion model architecture.
We find that systematic application of this philosophy eliminates the observed drifts and imbalances, resulting in considerably better networks at equal computational complexity.
arXiv Detail & Related papers (2023-12-05T11:55:47Z) - Assessing Neural Network Representations During Training Using
Noise-Resilient Diffusion Spectral Entropy [55.014926694758195]
Entropy and mutual information in neural networks provide rich information on the learning process.
We leverage data geometry to access the underlying manifold and reliably compute these information-theoretic measures.
We show that they form noise-resistant measures of intrinsic dimensionality and relationship strength in high-dimensional simulated data.
arXiv Detail & Related papers (2023-12-04T01:32:42Z) - Implicit Stochastic Gradient Descent for Training Physics-informed
Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems.
PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features.
In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z) - DR-DSGD: A Distributionally Robust Decentralized Learning Algorithm over
Graphs [54.08445874064361]
We propose to solve a regularized distributionally robust learning problem in the decentralized setting.
By adding a Kullback-Liebler regularization function to the robust min-max optimization problem, the learning problem can be reduced to a modified robust problem.
We show that our proposed algorithm can improve the worst distribution test accuracy by up to $10%$.
arXiv Detail & Related papers (2022-08-29T18:01:42Z) - Distribution-sensitive Information Retention for Accurate Binary Neural
Network [49.971345958676196]
We present a novel Distribution-sensitive Information Retention Network (DIR-Net) to retain the information of the forward activations and backward gradients.
Our DIR-Net consistently outperforms the SOTA binarization approaches under mainstream and compact architectures.
We conduct our DIR-Net on real-world resource-limited devices which achieves 11.1 times storage saving and 5.4 times speedup.
arXiv Detail & Related papers (2021-09-25T10:59:39Z) - DaSGD: Squeezing SGD Parallelization Performance in Distributed Training
Using Delayed Averaging [4.652668321425679]
Minibatch gradient descent (SGD) algorithm requires workers to halt forward/back propagations.
DaSGD parallelizes SGD and forward/back propagations to hide 100% of the communication overhead.
arXiv Detail & Related papers (2020-05-31T05:43:50Z) - OD-SGD: One-step Delay Stochastic Gradient Descent for Distributed
Training [5.888925582071453]
We propose a novel technology named One-step Delay SGD (OD-SGD) to combine their strengths in the training process.
We evaluate our proposed algorithm on MNIST, CIFAR-10 and ImageNet datasets.
arXiv Detail & Related papers (2020-05-14T05:33:36Z) - Detached Error Feedback for Distributed SGD with Random Sparsification [98.98236187442258]
Communication bottleneck has been a critical problem in large-scale deep learning.
We propose a new distributed error feedback (DEF) algorithm, which shows better convergence than error feedback for non-efficient distributed problems.
We also propose DEFA to accelerate the generalization of DEF, which shows better bounds than DEF.
arXiv Detail & Related papers (2020-04-11T03:50:59Z) - Scheduled Restart Momentum for Accelerated Stochastic Gradient Descent [32.40217829362088]
We propose a new NAG-style scheme for training deep neural networks (DNNs)
SRSGD replaces the constant momentum in SGD by the increasing momentum in NAG but stabilizes the iterations by resetting the momentum to zero according to a schedule.
On both CIFAR and ImageNet, SRSGD reaches similar or even better error rates with significantly fewer training epochs compared to the SGD baseline.
arXiv Detail & Related papers (2020-02-24T23:16:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.