Adaptive Gradient Method with Resilience and Momentum
- URL: http://arxiv.org/abs/2010.11041v1
- Date: Wed, 21 Oct 2020 14:49:00 GMT
- Title: Adaptive Gradient Method with Resilience and Momentum
- Authors: Jie Liu, Chen Lin, Chuming Li, Lu Sheng, Ming Sun, Junjie Yan, Wanli
Ouyang
- Abstract summary: We propose an Adaptive Gradient Method with Resilience and Momentum (AdaRem)
AdaRem adjusts the parameter-wise learning rate according to whether the direction of one parameter changes in the past is aligned with the direction of the current gradient.
Our method outperforms previous adaptive learning rate-based algorithms in terms of the training speed and the test error.
- Score: 120.83046824742455
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Several variants of stochastic gradient descent (SGD) have been proposed to
improve the learning effectiveness and efficiency when training deep neural
networks, among which some recent influential attempts would like to adaptively
control the parameter-wise learning rate (e.g., Adam and RMSProp). Although
they show a large improvement in convergence speed, most adaptive learning rate
methods suffer from compromised generalization compared with SGD. In this
paper, we proposed an Adaptive Gradient Method with Resilience and Momentum
(AdaRem), motivated by the observation that the oscillations of network
parameters slow the training, and give a theoretical proof of convergence. For
each parameter, AdaRem adjusts the parameter-wise learning rate according to
whether the direction of one parameter changes in the past is aligned with the
direction of the current gradient, and thus encourages long-term consistent
parameter updating with much fewer oscillations. Comprehensive experiments have
been conducted to verify the effectiveness of AdaRem when training various
models on a large-scale image recognition dataset, e.g., ImageNet, which also
demonstrate that our method outperforms previous adaptive learning rate-based
algorithms in terms of the training speed and the test error, respectively.
Related papers
- Adaptive Friction in Deep Learning: Enhancing Optimizers with Sigmoid and Tanh Function [0.0]
We introduce sigSignGrad and tanhSignGrad, two novel gradients that integrate adaptive friction coefficients.
Our theoretical analysis demonstrates the wide-ranging adjustment capability of the friction coefficient S.
Experiments on CIFAR-10, Mini-Image-Net using ResNet50 and ViT architectures confirm the superior performance our proposeds.
arXiv Detail & Related papers (2024-08-07T03:20:46Z) - Asymmetric Momentum: A Rethinking of Gradient Descent [4.1001738811512345]
We propose the simplest SGD enhanced method, Loss-Controlled Asymmetric Momentum(LCAM)
By averaging the loss, we divide training process into different loss phases and using different momentum.
We experimentally validate that weights have directional specificity, which are correlated with the specificity of the dataset.
arXiv Detail & Related papers (2023-09-05T11:16:47Z) - Implicit Stochastic Gradient Descent for Training Physics-informed
Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems.
PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features.
In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z) - No Parameters Left Behind: Sensitivity Guided Adaptive Learning Rate for
Training Large Transformer Models [132.90062129639705]
We propose a novel training strategy that encourages all parameters to be trained sufficiently.
A parameter with low sensitivity is redundant, and we improve its fitting by increasing its learning rate.
In contrast, a parameter with high sensitivity is well-trained and we regularize it by decreasing its learning rate to prevent further overfitting.
arXiv Detail & Related papers (2022-02-06T00:22:28Z) - Robust Learning via Persistency of Excitation [4.674053902991301]
We show that network training using gradient descent is equivalent to a dynamical system parameter estimation problem.
We provide an efficient technique for estimating the corresponding Lipschitz constant using extreme value theory.
Our approach also universally increases the adversarial accuracy by 0.1% to 0.3% points in various state-of-the-art adversarially trained models.
arXiv Detail & Related papers (2021-06-03T18:49:05Z) - Adam revisited: a weighted past gradients perspective [57.54752290924522]
We propose a novel adaptive method weighted adaptive algorithm (WADA) to tackle the non-convergence issues.
We prove that WADA can achieve a weighted data-dependent regret bound, which could be better than the original regret bound of ADAGRAD.
arXiv Detail & Related papers (2021-01-01T14:01:52Z) - AdaS: Adaptive Scheduling of Stochastic Gradients [50.80697760166045]
We introduce the notions of textit"knowledge gain" and textit"mapping condition" and propose a new algorithm called Adaptive Scheduling (AdaS)
Experimentation reveals that, using the derived metrics, AdaS exhibits: (a) faster convergence and superior generalization over existing adaptive learning methods; and (b) lack of dependence on a validation set to determine when to stop training.
arXiv Detail & Related papers (2020-06-11T16:36:31Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - Gradient Monitored Reinforcement Learning [0.0]
We focus on the enhancement of training and evaluation performance in reinforcement learning algorithms.
We propose an approach to steer the learning in the weight parameters of a neural network based on the dynamic development and feedback from the training process itself.
arXiv Detail & Related papers (2020-05-25T13:45:47Z) - A Dynamic Sampling Adaptive-SGD Method for Machine Learning [8.173034693197351]
We propose a method that adaptively controls the batch size used in the computation of gradient approximations and the step size used to move along such directions.
The proposed method exploits local curvature information and ensures that search directions are descent directions with high probability.
Numerical experiments show that this method is able to choose the best learning rates and compares favorably to fine-tuned SGD for training logistic regression and DNNs.
arXiv Detail & Related papers (2019-12-31T15:36:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.