On Convergence of Adam for Stochastic Optimization under Relaxed
Assumptions
- URL: http://arxiv.org/abs/2402.03982v1
- Date: Tue, 6 Feb 2024 13:19:26 GMT
- Title: On Convergence of Adam for Stochastic Optimization under Relaxed
Assumptions
- Authors: Yusu Hong and Junhong Lin
- Abstract summary: Adaptive Momentum Estimation (Adam) algorithm is highly effective in various deep learning tasks.
We show that Adam can find a stationary point variance with a rate in high iterations under this general noise model.
- Score: 4.9495085874952895
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Adaptive Momentum Estimation (Adam) algorithm is highly effective in
training various deep learning tasks. Despite this, there's limited theoretical
understanding for Adam, especially when focusing on its vanilla form in
non-convex smooth scenarios with potential unbounded gradients and affine
variance noise. In this paper, we study vanilla Adam under these challenging
conditions. We introduce a comprehensive noise model which governs affine
variance noise, bounded noise and sub-Gaussian noise. We show that Adam can
find a stationary point with a $\mathcal{O}(\text{poly}(\log T)/\sqrt{T})$ rate
in high probability under this general noise model where $T$ denotes total
number iterations, matching the lower rate of stochastic first-order algorithms
up to logarithm factors. More importantly, we reveal that Adam is free of
tuning step-sizes with any problem-parameters, yielding a better adaptation
property than the Stochastic Gradient Descent under the same conditions. We
also provide a probabilistic convergence result for Adam under a generalized
smooth condition which allows unbounded smoothness parameters and has been
illustrated empirically to more accurately capture the smooth property of many
practical objective functions.
Related papers
- Gradient Clipping Improves AdaGrad when the Noise Is Heavy-Tailed [83.8485684139678]
Methods with adaptive steps, as AdaGrad and Adam, are essential for training modern Deep Learning models.
We show that AdaGrad can have bad high-probability convergence if the noise istailed.
We propose a new version of AdaGrad called Clip-RAD RedaGrad with Delay.
arXiv Detail & Related papers (2024-06-06T18:49:10Z) - Convergence Guarantees for RMSProp and Adam in Generalized-smooth Non-convex Optimization with Affine Noise Variance [23.112775335244258]
We first analyze RMSProp, which is a special case of Adam with adaptive learning rates but without first-order momentum.
We develop a new upper bound first-order term in the descent lemma, which is also a function of the gradient norm.
Our results for both RMSProp and Adam match with the complexity established in citearvani2023lower.
arXiv Detail & Related papers (2024-04-01T19:17:45Z) - High Probability Convergence of Adam Under Unbounded Gradients and
Affine Variance Noise [4.9495085874952895]
We show that Adam could converge to the stationary point in high probability with a rate of $mathcalOleft(rm poly(log T)/sqrtTright)$ under coordinate-wise "affine" noise variance.
It is also revealed that Adam's confines within an order of $mathcalOleft(rm poly(left T)right)$ are adaptive to the noise level.
arXiv Detail & Related papers (2023-11-03T15:55:53Z) - UAdam: Unified Adam-Type Algorithmic Framework for Non-Convex Stochastic
Optimization [20.399244578926474]
We introduce a unified framework for Adam-type algorithms (called UAdam)
This is equipped with a general form of the second-order moment, such as NAdamBound, AdaFom, and Adan.
We show that UAdam converges to the neighborhood of stationary points with the rate of $mathcalO (1/T)$.
arXiv Detail & Related papers (2023-05-09T13:07:03Z) - Convergence of Adam Under Relaxed Assumptions [72.24779199744954]
We show that Adam converges to $epsilon$-stationary points with $O(epsilon-4)$ gradient complexity under far more realistic conditions.
We also propose a variance-reduced version of Adam with an accelerated gradient complexity of $O(epsilon-3)$.
arXiv Detail & Related papers (2023-04-27T06:27:37Z) - The Optimal Noise in Noise-Contrastive Learning Is Not What You Think [80.07065346699005]
We show that deviating from this assumption can actually lead to better statistical estimators.
In particular, the optimal noise distribution is different from the data's and even from a different family.
arXiv Detail & Related papers (2022-03-02T13:59:20Z) - A Novel Convergence Analysis for Algorithms of the Adam Family [105.22760323075008]
We present a generic proof of convergence for a family of Adam-style methods including Adam, AMSGrad, Adabound, etc.
Our analysis is so simple and generic that it can be leveraged to establish the convergence for solving a broader family of non- compositional optimization problems.
arXiv Detail & Related papers (2021-12-07T02:47:58Z) - Adam$^+$: A Stochastic Method with Adaptive Variance Reduction [56.051001950733315]
Adam is a widely used optimization method for deep learning applications.
We propose a new method named Adam$+$ (pronounced as Adam-plus)
Our empirical studies on various deep learning tasks, including image classification, language modeling, and automatic speech recognition, demonstrate that Adam$+$ significantly outperforms Adam.
arXiv Detail & Related papers (2020-11-24T09:28:53Z) - MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of
Gradients [112.00379151834242]
We propose adaptive learning rate principle, in which the running mean of squared gradient in Adam is replaced by a weighted mean, with weights chosen to maximize the estimated variance each coordinate.
This results in faster adaptation, which leads more desirable empirical convergence behaviors.
arXiv Detail & Related papers (2020-06-21T21:47:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.