Related papers: Convergence of Adam Under Relaxed Assumptions

Convergence of Adam Under Relaxed Assumptions

URL: http://arxiv.org/abs/2304.13972v3
Date: Tue, 7 Nov 2023 03:12:49 GMT
Title: Convergence of Adam Under Relaxed Assumptions
Authors: Haochuan Li, Alexander Rakhlin, Ali Jadbabaie
Abstract summary: We show that Adam converges to $epsilon$-stationary points with $O(epsilon-4)$ gradient complexity under far more realistic conditions. We also propose a variance-reduced version of Adam with an accelerated gradient complexity of $O(epsilon-3)$.
Score: 72.24779199744954
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we provide a rigorous proof of convergence of the Adaptive Moment Estimate (Adam) algorithm for a wide class of optimization objectives. Despite the popularity and efficiency of the Adam algorithm in training deep neural networks, its theoretical properties are not yet fully understood, and existing convergence proofs require unrealistically strong assumptions, such as globally bounded gradients, to show the convergence to stationary points. In this paper, we show that Adam provably converges to $\epsilon$-stationary points with ${O}(\epsilon^{-4})$ gradient complexity under far more realistic conditions. The key to our analysis is a new proof of boundedness of gradients along the optimization trajectory of Adam, under a generalized smoothness assumption according to which the local smoothness (i.e., Hessian norm when it exists) is bounded by a sub-quadratic function of the gradient norm. Moreover, we propose a variance-reduced version of Adam with an accelerated gradient complexity of ${O}(\epsilon^{-3})$.

Related papers

Simple Convergence Proof of Adam From a Sign-like Descent Perspective [58.89890024903816]
We show that Adam achieves the optimal rate of $cal O(frac1Ts14)$ rather than the previous $cal O(fracln TTs14)$.<n>Our theoretical analysis provides new insights into the role of momentum as a key factor ensuring convergence.
arXiv Detail & Related papers (2025-07-08T13:19:26Z)
Generalized Gradient Norm Clipping & Non-Euclidean $(L_0,L_1)$-Smoothness [51.302674884611335]
This work introduces a hybrid non-Euclidean optimization method which generalizes norm clipping by combining steepest descent and conditional gradient approaches.<n>We discuss how to instantiate the algorithms for deep learning and demonstrate their properties on image classification and language modeling.
arXiv Detail & Related papers (2025-06-02T17:34:29Z)
A Comprehensive Framework for Analyzing the Convergence of Adam: Bridging the Gap with SGD [28.905886549938305]
We introduce a novel and comprehensive framework for analyzing the convergence properties of Adam. We show that Adam attains non-asymptotic complexity sample bounds similar to those of gradient descent.
arXiv Detail & Related papers (2024-10-06T12:15:00Z)
Convergence Guarantees for RMSProp and Adam in Generalized-smooth Non-convex Optimization with Affine Noise Variance [23.112775335244258]
We first analyze RMSProp, which is a special case of Adam with adaptive learning rates but without first-order momentum. We develop a new upper bound first-order term in the descent lemma, which is also a function of the gradient norm. Our results for both RMSProp and Adam match with the complexity established in citearvani2023lower.
arXiv Detail & Related papers (2024-04-01T19:17:45Z)
High Probability Convergence of Adam Under Unbounded Gradients and Affine Variance Noise [4.9495085874952895]
We show that Adam could converge to the stationary point in high probability with a rate of $mathcalOleft(rm poly(log T)/sqrtTright)$ under coordinate-wise "affine" noise variance. It is also revealed that Adam's confines within an order of $mathcalOleft(rm poly(left T)right)$ are adaptive to the noise level.
arXiv Detail & Related papers (2023-11-03T15:55:53Z)
Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training. We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z)
Randomized Coordinate Subgradient Method for Nonsmooth Composite Optimization [11.017632675093628]
Coordinate-type subgradient methods for addressing nonsmooth problems are relatively underexplored due to the set of properties of the Lipschitz-type assumption.
arXiv Detail & Related papers (2022-06-30T02:17:11Z)
A Novel Convergence Analysis for Algorithms of the Adam Family [105.22760323075008]
We present a generic proof of convergence for a family of Adam-style methods including Adam, AMSGrad, Adabound, etc. Our analysis is so simple and generic that it can be leveraged to establish the convergence for solving a broader family of non- compositional optimization problems.
arXiv Detail & Related papers (2021-12-07T02:47:58Z)
Faster Convergence of Stochastic Gradient Langevin Dynamics for Non-Log-Concave Sampling [110.88857917726276]
We provide a new convergence analysis of gradient Langevin dynamics (SGLD) for sampling from a class of distributions that can be non-log-concave. At the core of our approach is a novel conductance analysis of SGLD using an auxiliary time-reversible Markov Chain.
arXiv Detail & Related papers (2020-10-19T15:23:18Z)
A Simple Convergence Proof of Adam and Adagrad [74.24716715922759]
We show a proof of convergence between the Adam Adagrad and $O(d(N)/st)$ algorithms. Adam converges with the same convergence $O(d(N)/st)$ when used with the default parameters.
arXiv Detail & Related papers (2020-03-05T01:56:17Z)
Towards Better Understanding of Adaptive Gradient Algorithms in Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks. In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems. Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.