Convergence of Adam Under Relaxed Assumptions
- URL: http://arxiv.org/abs/2304.13972v3
- Date: Tue, 7 Nov 2023 03:12:49 GMT
- Title: Convergence of Adam Under Relaxed Assumptions
- Authors: Haochuan Li, Alexander Rakhlin, Ali Jadbabaie
- Abstract summary: We show that Adam converges to $epsilon$-stationary points with $O(epsilon-4)$ gradient complexity under far more realistic conditions.
We also propose a variance-reduced version of Adam with an accelerated gradient complexity of $O(epsilon-3)$.
- Score: 72.24779199744954
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we provide a rigorous proof of convergence of the Adaptive
Moment Estimate (Adam) algorithm for a wide class of optimization objectives.
Despite the popularity and efficiency of the Adam algorithm in training deep
neural networks, its theoretical properties are not yet fully understood, and
existing convergence proofs require unrealistically strong assumptions, such as
globally bounded gradients, to show the convergence to stationary points. In
this paper, we show that Adam provably converges to $\epsilon$-stationary
points with ${O}(\epsilon^{-4})$ gradient complexity under far more realistic
conditions. The key to our analysis is a new proof of boundedness of gradients
along the optimization trajectory of Adam, under a generalized smoothness
assumption according to which the local smoothness (i.e., Hessian norm when it
exists) is bounded by a sub-quadratic function of the gradient norm. Moreover,
we propose a variance-reduced version of Adam with an accelerated gradient
complexity of ${O}(\epsilon^{-3})$.
Related papers
- A Comprehensive Framework for Analyzing the Convergence of Adam: Bridging the Gap with SGD [28.905886549938305]
We introduce a novel and comprehensive framework for analyzing the convergence properties of Adam.
We show that Adam attains non-asymptotic complexity sample bounds similar to those of gradient descent.
arXiv Detail & Related papers (2024-10-06T12:15:00Z) - Convergence Guarantees for RMSProp and Adam in Generalized-smooth Non-convex Optimization with Affine Noise Variance [23.112775335244258]
We first analyze RMSProp, which is a special case of Adam with adaptive learning rates but without first-order momentum.
We develop a new upper bound first-order term in the descent lemma, which is also a function of the gradient norm.
Our results for both RMSProp and Adam match with the complexity established in citearvani2023lower.
arXiv Detail & Related papers (2024-04-01T19:17:45Z) - High Probability Convergence of Adam Under Unbounded Gradients and
Affine Variance Noise [4.9495085874952895]
We show that Adam could converge to the stationary point in high probability with a rate of $mathcalOleft(rm poly(log T)/sqrtTright)$ under coordinate-wise "affine" noise variance.
It is also revealed that Adam's confines within an order of $mathcalOleft(rm poly(left T)right)$ are adaptive to the noise level.
arXiv Detail & Related papers (2023-11-03T15:55:53Z) - Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training.
We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z) - Randomized Coordinate Subgradient Method for Nonsmooth Composite
Optimization [11.017632675093628]
Coordinate-type subgradient methods for addressing nonsmooth problems are relatively underexplored due to the set of properties of the Lipschitz-type assumption.
arXiv Detail & Related papers (2022-06-30T02:17:11Z) - A Novel Convergence Analysis for Algorithms of the Adam Family [105.22760323075008]
We present a generic proof of convergence for a family of Adam-style methods including Adam, AMSGrad, Adabound, etc.
Our analysis is so simple and generic that it can be leveraged to establish the convergence for solving a broader family of non- compositional optimization problems.
arXiv Detail & Related papers (2021-12-07T02:47:58Z) - Faster Convergence of Stochastic Gradient Langevin Dynamics for
Non-Log-Concave Sampling [110.88857917726276]
We provide a new convergence analysis of gradient Langevin dynamics (SGLD) for sampling from a class of distributions that can be non-log-concave.
At the core of our approach is a novel conductance analysis of SGLD using an auxiliary time-reversible Markov Chain.
arXiv Detail & Related papers (2020-10-19T15:23:18Z) - A Simple Convergence Proof of Adam and Adagrad [74.24716715922759]
We show a proof of convergence between the Adam Adagrad and $O(d(N)/st)$ algorithms.
Adam converges with the same convergence $O(d(N)/st)$ when used with the default parameters.
arXiv Detail & Related papers (2020-03-05T01:56:17Z) - Towards Better Understanding of Adaptive Gradient Algorithms in
Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks.
In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems.
Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.