Related papers: A Comprehensive Framework for Analyzing the Convergence of Adam: Bridging the Gap with SGD

A Comprehensive Framework for Analyzing the Convergence of Adam: Bridging the Gap with SGD

URL: http://arxiv.org/abs/2410.04458v2
Date: Sat, 19 Oct 2024 09:33:12 GMT
Title: A Comprehensive Framework for Analyzing the Convergence of Adam: Bridging the Gap with SGD
Authors: Ruinan Jin, Xiao Li, Yaoliang Yu, Baoxiang Wang,
Abstract summary: We introduce a novel and comprehensive framework for analyzing the convergence properties of Adam. We show that Adam attains non-asymptotic complexity sample bounds similar to those of gradient descent.
Score: 28.905886549938305
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Adaptive Moment Estimation (Adam) is a cornerstone optimization algorithm in deep learning, widely recognized for its flexibility with adaptive learning rates and efficiency in handling large-scale data. However, despite its practical success, the theoretical understanding of Adam's convergence has been constrained by stringent assumptions, such as almost surely bounded stochastic gradients or uniformly bounded gradients, which are more restrictive than those typically required for analyzing stochastic gradient descent (SGD). In this paper, we introduce a novel and comprehensive framework for analyzing the convergence properties of Adam. This framework offers a versatile approach to establishing Adam's convergence. Specifically, we prove that Adam achieves asymptotic (last iterate sense) convergence in both the almost sure sense and the $L_1$ sense under the relaxed assumptions typically used for SGD, namely $L$-smoothness and the ABC inequality. Meanwhile, under the same assumptions, we show that Adam attains non-asymptotic sample complexity bounds similar to those of SGD.

Related papers

On the Convergence of Adam under Non-uniform Smoothness: Separability from SGDM and Beyond [35.65852208995095]
We demonstrate that Adam achieves a faster convergence compared to SGDM under the condition of non-uniformly bounded smoothness. Our findings reveal that: (1) in deterministic environments, Adam can attain the known lower bound for the convergence rate of deterministic first-orders, whereas the convergence rate of Gradient Descent with Momentum (GDM) has higher order dependence on the initial function value.
arXiv Detail & Related papers (2024-03-22T11:57:51Z)
High Probability Convergence of Adam Under Unbounded Gradients and Affine Variance Noise [4.9495085874952895]
We show that Adam could converge to the stationary point in high probability with a rate of $mathcalOleft(rm poly(log T)/sqrtTright)$ under coordinate-wise "affine" noise variance. It is also revealed that Adam's confines within an order of $mathcalOleft(rm poly(left T)right)$ are adaptive to the noise level.
arXiv Detail & Related papers (2023-11-03T15:55:53Z)
UAdam: Unified Adam-Type Algorithmic Framework for Non-Convex Stochastic Optimization [20.399244578926474]
We introduce a unified framework for Adam-type algorithms (called UAdam) This is equipped with a general form of the second-order moment, such as NAdamBound, AdaFom, and Adan. We show that UAdam converges to the neighborhood of stationary points with the rate of $mathcalO (1/T)$.
arXiv Detail & Related papers (2023-05-09T13:07:03Z)
Convergence of Adam Under Relaxed Assumptions [72.24779199744954]
We show that Adam converges to $epsilon$-stationary points with $O(epsilon-4)$ gradient complexity under far more realistic conditions. We also propose a variance-reduced version of Adam with an accelerated gradient complexity of $O(epsilon-3)$.
arXiv Detail & Related papers (2023-04-27T06:27:37Z)
From Gradient Flow on Population Loss to Learning with Stochastic Gradient Descent [50.4531316289086]
Gradient Descent (SGD) has been the method of choice for learning large-scale non-root models. An overarching paper is providing general conditions SGD converges, assuming that GF on the population loss converges. We provide a unified analysis for GD/SGD not only for classical settings like convex losses, but also for more complex problems including Retrieval Matrix sq-root.
arXiv Detail & Related papers (2022-10-13T03:55:04Z)
Provable Adaptivity of Adam under Non-uniform Smoothness [79.25087082434975]
Adam is widely adopted in practical applications due to its fast convergence. Existing convergence analyses for Adam rely on the bounded smoothness assumption. This paper studies the convergence of randomly reshuffled Adam with diminishing learning rate.
arXiv Detail & Related papers (2022-08-21T14:57:47Z)
Benign Underfitting of Stochastic Gradient Descent [72.38051710389732]
We study to what extent may gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit training data. We analyze the closely related with-replacement SGD, for which an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate.
arXiv Detail & Related papers (2022-02-27T13:25:01Z)
A Novel Convergence Analysis for Algorithms of the Adam Family [105.22760323075008]
We present a generic proof of convergence for a family of Adam-style methods including Adam, AMSGrad, Adabound, etc. Our analysis is so simple and generic that it can be leveraged to establish the convergence for solving a broader family of non- compositional optimization problems.
arXiv Detail & Related papers (2021-12-07T02:47:58Z)
Stochastic Gradient Descent-Ascent and Consensus Optimization for Smooth Games: Convergence Analysis under Expected Co-coercivity [49.66890309455787]
We introduce the expected co-coercivity condition, explain its benefits, and provide the first last-iterate convergence guarantees of SGDA and SCO. We prove linear convergence of both methods to a neighborhood of the solution when they use constant step-size. Our convergence guarantees hold under the arbitrary sampling paradigm, and we give insights into the complexity of minibatching.
arXiv Detail & Related papers (2021-06-30T18:32:46Z)
Towards Practical Adam: Non-Convexity, Convergence Theory, and Mini-Batch Acceleration [12.744658958445024]
Adam is one of the most influential adaptive algorithms for training deep neural networks. Existing approaches, such as decreasing an adaptive learning rate, adopting a big batch size, have tried to promote Adam-type algorithms to converge. We introduce an alternative easy-to-check sufficient condition, which merely depends on the parameters of historical base learning rate.
arXiv Detail & Related papers (2021-01-14T06:42:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.