Related papers: Adam$^+$: A Stochastic Method with Adaptive Variance Reduction

Adam$^+$: A Stochastic Method with Adaptive Variance Reduction

URL: http://arxiv.org/abs/2011.11985v1
Date: Tue, 24 Nov 2020 09:28:53 GMT
Title: Adam$^+$: A Stochastic Method with Adaptive Variance Reduction
Authors: Mingrui Liu, Wei Zhang, Francesco Orabona, Tianbao Yang
Abstract summary: Adam is a widely used optimization method for deep learning applications. We propose a new method named Adam$+$ (pronounced as Adam-plus) Our empirical studies on various deep learning tasks, including image classification, language modeling, and automatic speech recognition, demonstrate that Adam$+$ significantly outperforms Adam.
Score: 56.051001950733315
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Adam is a widely used stochastic optimization method for deep learning applications. While practitioners prefer Adam because it requires less parameter tuning, its use is problematic from a theoretical point of view since it may not converge. Variants of Adam have been proposed with provable convergence guarantee, but they tend not be competitive with Adam on the practical performance. In this paper, we propose a new method named Adam$^+$ (pronounced as Adam-plus). Adam$^+$ retains some of the key components of Adam but it also has several noticeable differences: (i) it does not maintain the moving average of second moment estimate but instead computes the moving average of first moment estimate at extrapolated points; (ii) its adaptive step size is formed not by dividing the square root of second moment estimate but instead by dividing the root of the norm of first moment estimate. As a result, Adam$^+$ requires few parameter tuning, as Adam, but it enjoys a provable convergence guarantee. Our analysis further shows that Adam$^+$ enjoys adaptive variance reduction, i.e., the variance of the stochastic gradient estimator reduces as the algorithm converges, hence enjoying an adaptive convergence. We also propose a more general variant of Adam$^+$ with different adaptive step sizes and establish their fast convergence rate. Our empirical studies on various deep learning tasks, including image classification, language modeling, and automatic speech recognition, demonstrate that Adam$^+$ significantly outperforms Adam and achieves comparable performance with best-tuned SGD and momentum SGD.

Related papers

On Convergence of Adam for Stochastic Optimization under Relaxed Assumptions [4.9495085874952895]
Adaptive Momentum Estimation (Adam) algorithm is highly effective in various deep learning tasks. We show that Adam can find a stationary point variance with a rate in high iterations under this general noise model.
arXiv Detail & Related papers (2024-02-06T13:19:26Z)
UAdam: Unified Adam-Type Algorithmic Framework for Non-Convex Stochastic Optimization [20.399244578926474]
We introduce a unified framework for Adam-type algorithms (called UAdam) This is equipped with a general form of the second-order moment, such as NAdamBound, AdaFom, and Adan. We show that UAdam converges to the neighborhood of stationary points with the rate of $mathcalO (1/T)$.
arXiv Detail & Related papers (2023-05-09T13:07:03Z)
Provable Adaptivity of Adam under Non-uniform Smoothness [79.25087082434975]
Adam is widely adopted in practical applications due to its fast convergence. Existing convergence analyses for Adam rely on the bounded smoothness assumption. This paper studies the convergence of randomly reshuffled Adam with diminishing learning rate.
arXiv Detail & Related papers (2022-08-21T14:57:47Z)
Understanding AdamW through Proximal Methods and Scale-Freeness [57.47324825501137]
Adam is a generalization of the $ell$ regularizer Adam-$ell$. AdamW decouples the gradient of Adam-$ell$ from the update rule of Adam-$ell$. We show that AdamW exhibits an advantage over Adam-$ell$ and the degree to which we expect the gradients of the network to exhibit multiple scales.
arXiv Detail & Related papers (2022-01-31T21:00:55Z)
A Novel Convergence Analysis for Algorithms of the Adam Family [105.22760323075008]
We present a generic proof of convergence for a family of Adam-style methods including Adam, AMSGrad, Adabound, etc. Our analysis is so simple and generic that it can be leveraged to establish the convergence for solving a broader family of non- compositional optimization problems.
arXiv Detail & Related papers (2021-12-07T02:47:58Z)
EAdam Optimizer: How $\epsilon$ Impact Adam [7.0552555621312605]
We discuss the impact of the constant $epsilon$ for Adam in this paper. Based on this finding, we propose a new variant of Adam called EAdam. Our method can bring significant improvement compared with Adam.
arXiv Detail & Related papers (2020-11-04T06:39:44Z)
MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of Gradients [112.00379151834242]
We propose adaptive learning rate principle, in which the running mean of squared gradient in Adam is replaced by a weighted mean, with weights chosen to maximize the estimated variance each coordinate. This results in faster adaptation, which leads more desirable empirical convergence behaviors.
arXiv Detail & Related papers (2020-06-21T21:47:43Z)
A Simple Convergence Proof of Adam and Adagrad [74.24716715922759]
We show a proof of convergence between the Adam Adagrad and $O(d(N)/st)$ algorithms. Adam converges with the same convergence $O(d(N)/st)$ when used with the default parameters.
arXiv Detail & Related papers (2020-03-05T01:56:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.