Adam$^+$: A Stochastic Method with Adaptive Variance Reduction
- URL: http://arxiv.org/abs/2011.11985v1
- Date: Tue, 24 Nov 2020 09:28:53 GMT
- Title: Adam$^+$: A Stochastic Method with Adaptive Variance Reduction
- Authors: Mingrui Liu, Wei Zhang, Francesco Orabona, Tianbao Yang
- Abstract summary: Adam is a widely used optimization method for deep learning applications.
We propose a new method named Adam$+$ (pronounced as Adam-plus)
Our empirical studies on various deep learning tasks, including image classification, language modeling, and automatic speech recognition, demonstrate that Adam$+$ significantly outperforms Adam.
- Score: 56.051001950733315
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Adam is a widely used stochastic optimization method for deep learning
applications. While practitioners prefer Adam because it requires less
parameter tuning, its use is problematic from a theoretical point of view since
it may not converge. Variants of Adam have been proposed with provable
convergence guarantee, but they tend not be competitive with Adam on the
practical performance. In this paper, we propose a new method named Adam$^+$
(pronounced as Adam-plus). Adam$^+$ retains some of the key components of Adam
but it also has several noticeable differences: (i) it does not maintain the
moving average of second moment estimate but instead computes the moving
average of first moment estimate at extrapolated points; (ii) its adaptive step
size is formed not by dividing the square root of second moment estimate but
instead by dividing the root of the norm of first moment estimate. As a result,
Adam$^+$ requires few parameter tuning, as Adam, but it enjoys a provable
convergence guarantee. Our analysis further shows that Adam$^+$ enjoys adaptive
variance reduction, i.e., the variance of the stochastic gradient estimator
reduces as the algorithm converges, hence enjoying an adaptive convergence. We
also propose a more general variant of Adam$^+$ with different adaptive step
sizes and establish their fast convergence rate. Our empirical studies on
various deep learning tasks, including image classification, language modeling,
and automatic speech recognition, demonstrate that Adam$^+$ significantly
outperforms Adam and achieves comparable performance with best-tuned SGD and
momentum SGD.
Related papers
- On Convergence of Adam for Stochastic Optimization under Relaxed
Assumptions [4.9495085874952895]
Adaptive Momentum Estimation (Adam) algorithm is highly effective in various deep learning tasks.
We show that Adam can find a stationary point variance with a rate in high iterations under this general noise model.
arXiv Detail & Related papers (2024-02-06T13:19:26Z) - UAdam: Unified Adam-Type Algorithmic Framework for Non-Convex Stochastic
Optimization [20.399244578926474]
We introduce a unified framework for Adam-type algorithms (called UAdam)
This is equipped with a general form of the second-order moment, such as NAdamBound, AdaFom, and Adan.
We show that UAdam converges to the neighborhood of stationary points with the rate of $mathcalO (1/T)$.
arXiv Detail & Related papers (2023-05-09T13:07:03Z) - Provable Adaptivity of Adam under Non-uniform Smoothness [79.25087082434975]
Adam is widely adopted in practical applications due to its fast convergence.
Existing convergence analyses for Adam rely on the bounded smoothness assumption.
This paper studies the convergence of randomly reshuffled Adam with diminishing learning rate.
arXiv Detail & Related papers (2022-08-21T14:57:47Z) - Understanding AdamW through Proximal Methods and Scale-Freeness [57.47324825501137]
Adam is a generalization of the $ell$ regularizer Adam-$ell$.
AdamW decouples the gradient of Adam-$ell$ from the update rule of Adam-$ell$.
We show that AdamW exhibits an advantage over Adam-$ell$ and the degree to which we expect the gradients of the network to exhibit multiple scales.
arXiv Detail & Related papers (2022-01-31T21:00:55Z) - A Novel Convergence Analysis for Algorithms of the Adam Family [105.22760323075008]
We present a generic proof of convergence for a family of Adam-style methods including Adam, AMSGrad, Adabound, etc.
Our analysis is so simple and generic that it can be leveraged to establish the convergence for solving a broader family of non- compositional optimization problems.
arXiv Detail & Related papers (2021-12-07T02:47:58Z) - EAdam Optimizer: How $\epsilon$ Impact Adam [7.0552555621312605]
We discuss the impact of the constant $epsilon$ for Adam in this paper.
Based on this finding, we propose a new variant of Adam called EAdam.
Our method can bring significant improvement compared with Adam.
arXiv Detail & Related papers (2020-11-04T06:39:44Z) - MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of
Gradients [112.00379151834242]
We propose adaptive learning rate principle, in which the running mean of squared gradient in Adam is replaced by a weighted mean, with weights chosen to maximize the estimated variance each coordinate.
This results in faster adaptation, which leads more desirable empirical convergence behaviors.
arXiv Detail & Related papers (2020-06-21T21:47:43Z) - A Simple Convergence Proof of Adam and Adagrad [74.24716715922759]
We show a proof of convergence between the Adam Adagrad and $O(d(N)/st)$ algorithms.
Adam converges with the same convergence $O(d(N)/st)$ when used with the default parameters.
arXiv Detail & Related papers (2020-03-05T01:56:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.