MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of
Gradients
- URL: http://arxiv.org/abs/2006.11918v4
- Date: Sun, 4 Jul 2021 19:33:57 GMT
- Title: MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of
Gradients
- Authors: Chen Zhu, Yu Cheng, Zhe Gan, Furong Huang, Jingjing Liu, Tom Goldstein
- Abstract summary: We propose adaptive learning rate principle, in which the running mean of squared gradient in Adam is replaced by a weighted mean, with weights chosen to maximize the estimated variance each coordinate.
This results in faster adaptation, which leads more desirable empirical convergence behaviors.
- Score: 112.00379151834242
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Adaptive gradient methods such as RMSProp and Adam use exponential moving
estimate of the squared gradient to compute adaptive step sizes, achieving
better convergence than SGD in face of noisy objectives. However, Adam can have
undesirable convergence behaviors due to unstable or extreme adaptive learning
rates. Methods such as AMSGrad and AdaBound have been proposed to stabilize the
adaptive learning rates of Adam in the later stage of training, but they do not
outperform Adam in some practical tasks such as training Transformers
\cite{transformer}. In this paper, we propose an adaptive learning rate
principle, in which the running mean of squared gradient in Adam is replaced by
a weighted mean, with weights chosen to maximize the estimated variance of each
coordinate. This results in a faster adaptation to the local gradient variance,
which leads to more desirable empirical convergence behaviors than Adam. We
prove the proposed algorithm converges under mild assumptions for nonconvex
stochastic optimization problems, and demonstrate the improved efficacy of our
adaptive averaging approach on machine translation, natural language
understanding and large-batch pretraining of BERT. The code is available at
https://github.com/zhuchen03/MaxVA.
Related papers
- StochGradAdam: Accelerating Neural Networks Training with Stochastic Gradient Sampling [0.0]
We introduce StochGradAdam, a novel extension of the Adam algorithm, incorporating gradient sampling techniques.
StochGradAdam achieves comparable or superior performance to Adam, even when using fewer gradient updates per iteration.
The results suggest that this approach is particularly effective for large-scale models and datasets.
arXiv Detail & Related papers (2023-10-25T22:45:31Z) - Adaptive Gradient Methods at the Edge of Stability [23.246757545508444]
We shed light on the training dynamics of adaptive gradient methods like Adam in deep learning.
Our findings can serve as a foundation for the community's future understanding of adaptive gradient methods in deep learning.
arXiv Detail & Related papers (2022-07-29T05:23:47Z) - A Control Theoretic Framework for Adaptive Gradient Optimizers in
Machine Learning [0.6526824510982802]
Adaptive gradient methods have become popular in optimizing deep neural networks.
Recent examples include AdaGrad and Adam.
We develop a generic framework for adaptive gradient methods.
arXiv Detail & Related papers (2022-06-04T17:55:33Z) - On the SDEs and Scaling Rules for Adaptive Gradient Algorithms [45.007261870784475]
Approxing Gradient Descent (SGD) as a Differential Equation (SDE) has allowed researchers to enjoy the benefits of studying a continuous optimization trajectory.
This paper derives the SDE approximations for RMSprop and Adam, giving theoretical guarantees of correctness as well as experimental validation of their applicability.
arXiv Detail & Related papers (2022-05-20T16:39:03Z) - Understanding the Generalization of Adam in Learning Neural Networks
with Proper Regularization [118.50301177912381]
We show that Adam can converge to different solutions of the objective with provably different errors, even with weight decay globalization.
We show that if convex, and the weight decay regularization is employed, any optimization algorithms including Adam will converge to the same solution.
arXiv Detail & Related papers (2021-08-25T17:58:21Z) - CADA: Communication-Adaptive Distributed Adam [31.02472517086767]
gradient descent (SGD) has taken the stage as the primary workhorse for large-scale machine learning.
This paper proposes an adaptive gradient descent method for distributed machine learning, which can be viewed as the communication-adaptive counterpart of the celebrated Adam method.
arXiv Detail & Related papers (2020-12-31T06:52:18Z) - Adam$^+$: A Stochastic Method with Adaptive Variance Reduction [56.051001950733315]
Adam is a widely used optimization method for deep learning applications.
We propose a new method named Adam$+$ (pronounced as Adam-plus)
Our empirical studies on various deep learning tasks, including image classification, language modeling, and automatic speech recognition, demonstrate that Adam$+$ significantly outperforms Adam.
arXiv Detail & Related papers (2020-11-24T09:28:53Z) - Balancing Rates and Variance via Adaptive Batch-Size for Stochastic
Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error.
Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z) - Towards Better Understanding of Adaptive Gradient Algorithms in
Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks.
In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems.
Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z) - On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization [80.03647903934723]
We prove adaptive gradient methods in expectation of gradient convergence methods.
Our analyses shed light on better adaptive gradient methods in optimizing non understanding gradient bounds.
arXiv Detail & Related papers (2018-08-16T20:25:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.