A Novel Convergence Analysis for Algorithms of the Adam Family
- URL: http://arxiv.org/abs/2112.03459v1
- Date: Tue, 7 Dec 2021 02:47:58 GMT
- Title: A Novel Convergence Analysis for Algorithms of the Adam Family
- Authors: Zhishuai Guo, Yi Xu, Wotao Yin, Rong Jin, Tianbao Yang
- Abstract summary: We present a generic proof of convergence for a family of Adam-style methods including Adam, AMSGrad, Adabound, etc.
Our analysis is so simple and generic that it can be leveraged to establish the convergence for solving a broader family of non- compositional optimization problems.
- Score: 105.22760323075008
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Since its invention in 2014, the Adam optimizer has received tremendous
attention. On one hand, it has been widely used in deep learning and many
variants have been proposed, while on the other hand their theoretical
convergence property remains to be a mystery. It is far from satisfactory in
the sense that some studies require strong assumptions about the updates, which
are not necessarily applicable in practice, while other studies still follow
the original problematic convergence analysis of Adam, which was shown to be
not sufficient to ensure convergence. Although rigorous convergence analysis
exists for Adam, they impose specific requirements on the update of the
adaptive step size, which are not generic enough to cover many other variants
of Adam. To address theses issues, in this extended abstract, we present a
simple and generic proof of convergence for a family of Adam-style methods
(including Adam, AMSGrad, Adabound, etc.). Our analysis only requires an
increasing or large "momentum" parameter for the first-order moment, which is
indeed the case used in practice, and a boundness condition on the adaptive
factor of the step size, which applies to all variants of Adam under mild
conditions of stochastic gradients. We also establish a variance diminishing
result for the used stochastic gradient estimators. Indeed, our analysis of
Adam is so simple and generic that it can be leveraged to establish the
convergence for solving a broader family of non-convex optimization problems,
including min-max, compositional, and bilevel optimization problems. For the
full (earlier) version of this extended abstract, please refer to
arXiv:2104.14840.
Related papers
- A Comprehensive Framework for Analyzing the Convergence of Adam: Bridging the Gap with SGD [28.905886549938305]
We introduce a novel and comprehensive framework for analyzing the convergence properties of Adam.
We show that Adam attains non-asymptotic complexity sample bounds similar to those of gradient descent.
arXiv Detail & Related papers (2024-10-06T12:15:00Z) - High Probability Convergence of Adam Under Unbounded Gradients and
Affine Variance Noise [4.9495085874952895]
We show that Adam could converge to the stationary point in high probability with a rate of $mathcalOleft(rm poly(log T)/sqrtTright)$ under coordinate-wise "affine" noise variance.
It is also revealed that Adam's confines within an order of $mathcalOleft(rm poly(left T)right)$ are adaptive to the noise level.
arXiv Detail & Related papers (2023-11-03T15:55:53Z) - Closing the Gap Between the Upper Bound and the Lower Bound of Adam's
Iteration Complexity [51.96093077151991]
We derive a new convergence guarantee of Adam, with only an $L$-smooth condition and a bounded noise variance assumption.
Our proof utilizes novel techniques to handle the entanglement between momentum and adaptive learning rate.
arXiv Detail & Related papers (2023-10-27T09:16:58Z) - UAdam: Unified Adam-Type Algorithmic Framework for Non-Convex Stochastic
Optimization [20.399244578926474]
We introduce a unified framework for Adam-type algorithms (called UAdam)
This is equipped with a general form of the second-order moment, such as NAdamBound, AdaFom, and Adan.
We show that UAdam converges to the neighborhood of stationary points with the rate of $mathcalO (1/T)$.
arXiv Detail & Related papers (2023-05-09T13:07:03Z) - Convergence of Adam Under Relaxed Assumptions [72.24779199744954]
We show that Adam converges to $epsilon$-stationary points with $O(epsilon-4)$ gradient complexity under far more realistic conditions.
We also propose a variance-reduced version of Adam with an accelerated gradient complexity of $O(epsilon-3)$.
arXiv Detail & Related papers (2023-04-27T06:27:37Z) - Provable Adaptivity of Adam under Non-uniform Smoothness [79.25087082434975]
Adam is widely adopted in practical applications due to its fast convergence.
Existing convergence analyses for Adam rely on the bounded smoothness assumption.
This paper studies the convergence of randomly reshuffled Adam with diminishing learning rate.
arXiv Detail & Related papers (2022-08-21T14:57:47Z) - Understanding the Generalization of Adam in Learning Neural Networks
with Proper Regularization [118.50301177912381]
We show that Adam can converge to different solutions of the objective with provably different errors, even with weight decay globalization.
We show that if convex, and the weight decay regularization is employed, any optimization algorithms including Adam will converge to the same solution.
arXiv Detail & Related papers (2021-08-25T17:58:21Z) - Towards Practical Adam: Non-Convexity, Convergence Theory, and
Mini-Batch Acceleration [12.744658958445024]
Adam is one of the most influential adaptive algorithms for training deep neural networks.
Existing approaches, such as decreasing an adaptive learning rate, adopting a big batch size, have tried to promote Adam-type algorithms to converge.
We introduce an alternative easy-to-check sufficient condition, which merely depends on the parameters of historical base learning rate.
arXiv Detail & Related papers (2021-01-14T06:42:29Z) - Adam$^+$: A Stochastic Method with Adaptive Variance Reduction [56.051001950733315]
Adam is a widely used optimization method for deep learning applications.
We propose a new method named Adam$+$ (pronounced as Adam-plus)
Our empirical studies on various deep learning tasks, including image classification, language modeling, and automatic speech recognition, demonstrate that Adam$+$ significantly outperforms Adam.
arXiv Detail & Related papers (2020-11-24T09:28:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.