Scheduled Restart Momentum for Accelerated Stochastic Gradient Descent
- URL: http://arxiv.org/abs/2002.10583v2
- Date: Sun, 26 Apr 2020 11:55:17 GMT
- Title: Scheduled Restart Momentum for Accelerated Stochastic Gradient Descent
- Authors: Bao Wang, Tan M. Nguyen, Andrea L. Bertozzi, Richard G. Baraniuk,
Stanley J. Osher
- Abstract summary: We propose a new NAG-style scheme for training deep neural networks (DNNs)
SRSGD replaces the constant momentum in SGD by the increasing momentum in NAG but stabilizes the iterations by resetting the momentum to zero according to a schedule.
On both CIFAR and ImageNet, SRSGD reaches similar or even better error rates with significantly fewer training epochs compared to the SGD baseline.
- Score: 32.40217829362088
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Stochastic gradient descent (SGD) with constant momentum and its variants
such as Adam are the optimization algorithms of choice for training deep neural
networks (DNNs). Since DNN training is incredibly computationally expensive,
there is great interest in speeding up the convergence. Nesterov accelerated
gradient (NAG) improves the convergence rate of gradient descent (GD) for
convex optimization using a specially designed momentum; however, it
accumulates error when an inexact gradient is used (such as in SGD), slowing
convergence at best and diverging at worst. In this paper, we propose Scheduled
Restart SGD (SRSGD), a new NAG-style scheme for training DNNs. SRSGD replaces
the constant momentum in SGD by the increasing momentum in NAG but stabilizes
the iterations by resetting the momentum to zero according to a schedule. Using
a variety of models and benchmarks for image classification, we demonstrate
that, in training DNNs, SRSGD significantly improves convergence and
generalization; for instance in training ResNet200 for ImageNet classification,
SRSGD achieves an error rate of 20.93% vs. the benchmark of 22.13%. These
improvements become more significant as the network grows deeper. Furthermore,
on both CIFAR and ImageNet, SRSGD reaches similar or even better error rates
with significantly fewer training epochs compared to the SGD baseline.
Related papers
- Membrane Potential Distribution Adjustment and Parametric Surrogate
Gradient in Spiking Neural Networks [3.485537704990941]
Surrogate gradient (SG) strategy is investigated and applied to circumvent this issue and train SNNs from scratch.
We propose the parametric surrogate gradient (PSG) method to iteratively update SG and eventually determine an optimal surrogate gradient parameter.
Experimental results demonstrate that the proposed methods can be readily integrated with backpropagation through time (BPTT) algorithm.
arXiv Detail & Related papers (2023-04-26T05:02:41Z) - Implicit Stochastic Gradient Descent for Training Physics-informed
Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems.
PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features.
In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z) - Online Training Through Time for Spiking Neural Networks [66.7744060103562]
Spiking neural networks (SNNs) are promising brain-inspired energy-efficient models.
Recent progress in training methods has enabled successful deep SNNs on large-scale tasks with low latency.
We propose online training through time (OTTT) for SNNs, which is derived from BPTT to enable forward-in-time learning.
arXiv Detail & Related papers (2022-10-09T07:47:56Z) - Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep
Models [158.19276683455254]
Adaptive gradient algorithms borrow the moving average idea of heavy ball acceleration to estimate accurate first second-order moments of gradient for accelerating convergence.
Nesterov acceleration converges faster than ball acceleration in theory and also in many empirical cases.
In this paper we develop a new Nesterov momentum estimation (NME) method, which avoids the extra computation and memory overhead of computing gradient at the point.
We show that Adan surpasses the corresponding SoTAs on both vision transformers (ViTs and CNNs) and sets new SoTAs for many popular networks.
arXiv Detail & Related papers (2022-08-13T16:04:39Z) - Training High-Performance Low-Latency Spiking Neural Networks by
Differentiation on Spike Representation [70.75043144299168]
Spiking Neural Network (SNN) is a promising energy-efficient AI model when implemented on neuromorphic hardware.
It is a challenge to efficiently train SNNs due to their non-differentiability.
We propose the Differentiation on Spike Representation (DSR) method, which could achieve high performance.
arXiv Detail & Related papers (2022-05-01T12:44:49Z) - Temporal Efficient Training of Spiking Neural Network via Gradient
Re-weighting [29.685909045226847]
Brain-inspired spiking neuron networks (SNNs) have attracted widespread research interest because of their event-driven and energy-efficient characteristics.
Current direct training approach with surrogate gradient results in SNNs with poor generalizability.
We introduce the temporal efficient training (TET) approach to compensate for the loss of momentum in the gradient descent with SG.
arXiv Detail & Related papers (2022-02-24T08:02:37Z) - Guided parallelized stochastic gradient descent for delay compensation [0.0]
gradient descent (SGD) algorithm and its variations have been effectively used to optimize neural network models.
With the rapid growth of big data and deep learning, SGD is no longer the most suitable choice due to its natural behavior of sequential optimization of the error function.
This has led to the development of parallel SGD algorithms, such as asynchronous SGD (ASGD) and synchronous SGD (SSGD) to train deep neural networks.
arXiv Detail & Related papers (2021-01-17T23:12:40Z) - Stochastic Gradient Descent with Nonlinear Conjugate Gradient-Style
Adaptive Momentum [9.843647947055745]
In deep learning practice, momentum is usually weighted by a well-calibrated constant.
We propose a novel emphadaptive momentum for improving DNNs training.
arXiv Detail & Related papers (2020-12-03T18:59:43Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - OD-SGD: One-step Delay Stochastic Gradient Descent for Distributed
Training [5.888925582071453]
We propose a novel technology named One-step Delay SGD (OD-SGD) to combine their strengths in the training process.
We evaluate our proposed algorithm on MNIST, CIFAR-10 and ImageNet datasets.
arXiv Detail & Related papers (2020-05-14T05:33:36Z) - Gradient Centralization: A New Optimization Technique for Deep Neural
Networks [74.935141515523]
gradient centralization (GC) operates directly on gradients by centralizing the gradient vectors to have zero mean.
GC can be viewed as a projected gradient descent method with a constrained loss function.
GC is very simple to implement and can be easily embedded into existing gradient based DNNs with only one line of code.
arXiv Detail & Related papers (2020-04-03T10:25:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.