Adam revisited: a weighted past gradients perspective
- URL: http://arxiv.org/abs/2101.00238v1
- Date: Fri, 1 Jan 2021 14:01:52 GMT
- Title: Adam revisited: a weighted past gradients perspective
- Authors: Hui Zhong, Zaiyi Chen, Chuan Qin, Zai Huang, Vincent W. Zheng, Tong
Xu, Enhong Chen
- Abstract summary: We propose a novel adaptive method weighted adaptive algorithm (WADA) to tackle the non-convergence issues.
We prove that WADA can achieve a weighted data-dependent regret bound, which could be better than the original regret bound of ADAGRAD.
- Score: 57.54752290924522
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Adaptive learning rate methods have been successfully applied in many fields,
especially in training deep neural networks. Recent results have shown that
adaptive methods with exponential increasing weights on squared past gradients
(i.e., ADAM, RMSPROP) may fail to converge to the optimal solution. Though many
algorithms, such as AMSGRAD and ADAMNC, have been proposed to fix the
non-convergence issues, achieving a data-dependent regret bound similar to or
better than ADAGRAD is still a challenge to these methods. In this paper, we
propose a novel adaptive method weighted adaptive algorithm (WADA) to tackle
the non-convergence issues. Unlike AMSGRAD and ADAMNC, we consider using a
milder growing weighting strategy on squared past gradient, in which weights
grow linearly. Based on this idea, we propose weighted adaptive gradient method
framework (WAGMF) and implement WADA algorithm on this framework. Moreover, we
prove that WADA can achieve a weighted data-dependent regret bound, which could
be better than the original regret bound of ADAGRAD when the gradients decrease
rapidly. This bound may partially explain the good performance of ADAM in
practice. Finally, extensive experiments demonstrate the effectiveness of WADA
and its variants in comparison with several variants of ADAM on training convex
problems and deep neural networks.
Related papers
- MARS: Unleashing the Power of Variance Reduction for Training Large Models [56.47014540413659]
Large gradient algorithms like Adam, Adam, and their variants have been central to the development of this type of training.
We propose a framework that reconciles preconditioned gradient optimization methods with variance reduction via a scaled momentum technique.
arXiv Detail & Related papers (2024-11-15T18:57:39Z) - AA-DLADMM: An Accelerated ADMM-based Framework for Training Deep Neural
Networks [1.3812010983144802]
gradient descent (SGD) and its many variants are the widespread optimization algorithms for training deep neural networks.
SGD suffers from inevitable drawbacks, including vanishing gradients, lack of theoretical guarantees, and substantial sensitivity to input.
This paper proposes an Anderson Acceleration for Deep Learning ADMM (AA-DLADMM) algorithm to tackle this drawback.
arXiv Detail & Related papers (2024-01-08T01:22:00Z) - Implicit Stochastic Gradient Descent for Training Physics-informed
Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems.
PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features.
In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z) - Divergence Results and Convergence of a Variance Reduced Version of ADAM [30.10316505009956]
We show that an ADAM-type algorithm converges, which means that it is the variance of gradients that causes the divergence of original ADAM.
Numerical experiments show that the proposed algorithm has as good performance as ADAM.
arXiv Detail & Related papers (2022-10-11T16:54:56Z) - Exploiting Adam-like Optimization Algorithms to Improve the Performance
of Convolutional Neural Networks [82.61182037130405]
gradient descent (SGD) is the main approach for training deep networks.
In this work, we compare Adam based variants based on the difference between the present and the past gradients.
We have tested ensemble of networks and the fusion with ResNet50 trained with gradient descent.
arXiv Detail & Related papers (2021-03-26T18:55:08Z) - Adaptive Gradient Method with Resilience and Momentum [120.83046824742455]
We propose an Adaptive Gradient Method with Resilience and Momentum (AdaRem)
AdaRem adjusts the parameter-wise learning rate according to whether the direction of one parameter changes in the past is aligned with the direction of the current gradient.
Our method outperforms previous adaptive learning rate-based algorithms in terms of the training speed and the test error.
arXiv Detail & Related papers (2020-10-21T14:49:00Z) - MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of
Gradients [112.00379151834242]
We propose adaptive learning rate principle, in which the running mean of squared gradient in Adam is replaced by a weighted mean, with weights chosen to maximize the estimated variance each coordinate.
This results in faster adaptation, which leads more desirable empirical convergence behaviors.
arXiv Detail & Related papers (2020-06-21T21:47:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.