Stochastic Gradient Descent with Nonlinear Conjugate Gradient-Style
Adaptive Momentum
- URL: http://arxiv.org/abs/2012.02188v1
- Date: Thu, 3 Dec 2020 18:59:43 GMT
- Title: Stochastic Gradient Descent with Nonlinear Conjugate Gradient-Style
Adaptive Momentum
- Authors: Bao Wang, Qiang Ye
- Abstract summary: In deep learning practice, momentum is usually weighted by a well-calibrated constant.
We propose a novel emphadaptive momentum for improving DNNs training.
- Score: 9.843647947055745
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Momentum plays a crucial role in stochastic gradient-based optimization
algorithms for accelerating or improving training deep neural networks (DNNs).
In deep learning practice, the momentum is usually weighted by a
well-calibrated constant. However, tuning hyperparameters for momentum can be a
significant computational burden. In this paper, we propose a novel
\emph{adaptive momentum} for improving DNNs training; this adaptive momentum,
with no momentum related hyperparameter required, is motivated by the nonlinear
conjugate gradient (NCG) method. Stochastic gradient descent (SGD) with this
new adaptive momentum eliminates the need for the momentum hyperparameter
calibration, allows a significantly larger learning rate, accelerates DNN
training, and improves final accuracy and robustness of the trained DNNs. For
instance, SGD with this adaptive momentum reduces classification errors for
training ResNet110 for CIFAR10 and CIFAR100 from $5.25\%$ to $4.64\%$ and
$23.75\%$ to $20.03\%$, respectively. Furthermore, SGD with the new adaptive
momentum also benefits adversarial training and improves adversarial robustness
of the trained DNNs.
Related papers
- The Marginal Value of Momentum for Small Learning Rate SGD [20.606430391298815]
Momentum is known to accelerate the convergence of gradient descent in strongly convex settings without gradient noise regimes.
Experiments show that momentum indeed has limited benefits for both optimization and generalization in practical training where the optimal learning rate is not very large.
arXiv Detail & Related papers (2023-07-27T21:01:26Z) - Implicit Stochastic Gradient Descent for Training Physics-informed
Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems.
PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features.
In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z) - SPIDE: A Purely Spike-based Method for Training Feedback Spiking Neural
Networks [56.35403810762512]
Spiking neural networks (SNNs) with event-based computation are promising brain-inspired models for energy-efficient applications on neuromorphic hardware.
We study spike-based implicit differentiation on the equilibrium state (SPIDE) that extends the recently proposed training method.
arXiv Detail & Related papers (2023-02-01T04:22:59Z) - Flatter, faster: scaling momentum for optimal speedup of SGD [0.0]
We study training dynamics arising from interplay between gradient descent (SGD) and label noise and momentum in the training of neural networks.
We find that scaling the momentum hyper parameter $1-NISTbeta$ with the learning rate to the power of $2/3$ maximally accelerates training, without sacrificing generalization.
arXiv Detail & Related papers (2022-10-28T20:41:48Z) - Online Training Through Time for Spiking Neural Networks [66.7744060103562]
Spiking neural networks (SNNs) are promising brain-inspired energy-efficient models.
Recent progress in training methods has enabled successful deep SNNs on large-scale tasks with low latency.
We propose online training through time (OTTT) for SNNs, which is derived from BPTT to enable forward-in-time learning.
arXiv Detail & Related papers (2022-10-09T07:47:56Z) - Temporal Efficient Training of Spiking Neural Network via Gradient
Re-weighting [29.685909045226847]
Brain-inspired spiking neuron networks (SNNs) have attracted widespread research interest because of their event-driven and energy-efficient characteristics.
Current direct training approach with surrogate gradient results in SNNs with poor generalizability.
We introduce the temporal efficient training (TET) approach to compensate for the loss of momentum in the gradient descent with SG.
arXiv Detail & Related papers (2022-02-24T08:02:37Z) - Training Deep Neural Networks with Adaptive Momentum Inspired by the
Quadratic Optimization [20.782428252187024]
We propose a new adaptive momentum inspired by the optimal choice of the heavy ball momentum for optimization.
Our proposed adaptive heavy ball momentum can improve gradient descent (SGD) and Adam.
We verify the efficiency of SGD and Adam with the new adaptive momentum on extensive machine learning benchmarks, including image classification, language modeling, and machine translation.
arXiv Detail & Related papers (2021-10-18T07:03:48Z) - Low-Precision Training in Logarithmic Number System using Multiplicative
Weight Update [49.948082497688404]
Training large-scale deep neural networks (DNNs) currently requires a significant amount of energy, leading to serious environmental impacts.
One promising approach to reduce the energy costs is representing DNNs with low-precision numbers.
We jointly design a lowprecision training framework involving a logarithmic number system (LNS) and a multiplicative weight update training method, termed LNS-Madam.
arXiv Detail & Related papers (2021-06-26T00:32:17Z) - FracTrain: Fractionally Squeezing Bit Savings Both Temporally and
Spatially for Efficient DNN Training [81.85361544720885]
We propose FracTrain that integrates progressive fractional quantization which gradually increases the precision of activations, weights, and gradients.
FracTrain reduces computational cost and hardware-quantified energy/latency of DNN training while achieving a comparable or better (-0.12%+1.87%) accuracy.
arXiv Detail & Related papers (2020-12-24T05:24:10Z) - Adaptive Gradient Method with Resilience and Momentum [120.83046824742455]
We propose an Adaptive Gradient Method with Resilience and Momentum (AdaRem)
AdaRem adjusts the parameter-wise learning rate according to whether the direction of one parameter changes in the past is aligned with the direction of the current gradient.
Our method outperforms previous adaptive learning rate-based algorithms in terms of the training speed and the test error.
arXiv Detail & Related papers (2020-10-21T14:49:00Z) - Scheduled Restart Momentum for Accelerated Stochastic Gradient Descent [32.40217829362088]
We propose a new NAG-style scheme for training deep neural networks (DNNs)
SRSGD replaces the constant momentum in SGD by the increasing momentum in NAG but stabilizes the iterations by resetting the momentum to zero according to a schedule.
On both CIFAR and ImageNet, SRSGD reaches similar or even better error rates with significantly fewer training epochs compared to the SGD baseline.
arXiv Detail & Related papers (2020-02-24T23:16:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.