A Simple Convergence Proof of Adam and Adagrad
- URL: http://arxiv.org/abs/2003.02395v3
- Date: Mon, 17 Oct 2022 13:20:40 GMT
- Title: A Simple Convergence Proof of Adam and Adagrad
- Authors: Alexandre D\'efossez, L\'eon Bottou, Francis Bach, Nicolas Usunier
- Abstract summary: We show a proof of convergence between the Adam Adagrad and $O(d(N)/st)$ algorithms.
Adam converges with the same convergence $O(d(N)/st)$ when used with the default parameters.
- Score: 74.24716715922759
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We provide a simple proof of convergence covering both the Adam and Adagrad
adaptive optimization algorithms when applied to smooth (possibly non-convex)
objective functions with bounded gradients. We show that in expectation, the
squared norm of the objective gradient averaged over the trajectory has an
upper-bound which is explicit in the constants of the problem, parameters of
the optimizer, the dimension $d$, and the total number of iterations $N$. This
bound can be made arbitrarily small, and with the right hyper-parameters, Adam
can be shown to converge with the same rate of convergence
$O(d\ln(N)/\sqrt{N})$. When used with the default parameters, Adam doesn't
converge, however, and just like constant step-size SGD, it moves away from the
initialization point faster than Adagrad, which might explain its practical
success. Finally, we obtain the tightest dependency on the heavy ball momentum
decay rate $\beta_1$ among all previous convergence bounds for non-convex Adam
and Adagrad, improving from $O((1-\beta_1)^{-3})$ to $O((1-\beta_1)^{-1})$.
Related papers
- Convergence Guarantees for RMSProp and Adam in Generalized-smooth Non-convex Optimization with Affine Noise Variance [23.112775335244258]
We first analyze RMSProp, which is a special case of Adam with adaptive learning rates but without first-order momentum.
We develop a new upper bound first-order term in the descent lemma, which is also a function of the gradient norm.
Our results for both RMSProp and Adam match with the complexity established in citearvani2023lower.
arXiv Detail & Related papers (2024-04-01T19:17:45Z) - Revisiting the Last-Iterate Convergence of Stochastic Gradient Methods [25.831462008050387]
The Gradient Descent (SGD) algorithm has triggered people's interest due to its good performance in practice but lack of theoretical understanding.
It still remains unclear whether the lastiterate convergence can be provably extended to wider composite optimization and non-Euclidean norms.
arXiv Detail & Related papers (2023-12-13T21:41:06Z) - Convergence of Adam Under Relaxed Assumptions [72.24779199744954]
We show that Adam converges to $epsilon$-stationary points with $O(epsilon-4)$ gradient complexity under far more realistic conditions.
We also propose a variance-reduced version of Adam with an accelerated gradient complexity of $O(epsilon-3)$.
arXiv Detail & Related papers (2023-04-27T06:27:37Z) - High Probability Convergence of Stochastic Gradient Methods [15.829413808059124]
We show convergence with bounds depending on the initial distance to the optimal solution.
We demonstrate that our techniques can be used to obtain high bound for AdaGrad-Norm.
arXiv Detail & Related papers (2023-02-28T18:42:11Z) - Over-Parameterization Exponentially Slows Down Gradient Descent for
Learning a Single Neuron [49.45105570960104]
We prove the global convergence of randomly gradient descent with a $Oleft(T-3right)$ rate.
These two bounds jointly give an exact characterization of the convergence rate.
We show this potential function converges slowly, which implies the slow convergence rate of the loss function.
arXiv Detail & Related papers (2023-02-20T15:33:26Z) - Sharper Convergence Guarantees for Asynchronous SGD for Distributed and
Federated Learning [77.22019100456595]
We show a training algorithm for distributed computation workers with varying communication frequency.
In this work, we obtain a tighter convergence rate of $mathcalO!!!(sigma2-2_avg!! .
We also show that the heterogeneity term in rate is affected by the average delay within each worker.
arXiv Detail & Related papers (2022-06-16T17:10:57Z) - Proximal Gradient Descent-Ascent: Variable Convergence under K{\L}
Geometry [49.65455534654459]
The finite descent-ascent parameters (GDA) has been widely applied to solve minimax optimization problems.
This paper fills such a gap by studying the convergence of the KL-Lized geometry.
arXiv Detail & Related papers (2021-02-09T05:35:53Z) - Last iterate convergence of SGD for Least-Squares in the Interpolation
regime [19.05750582096579]
We study the noiseless model in the fundamental least-squares setup.
We assume that an optimum predictor fits perfectly inputs and outputs $langle theta_*, phi(X) rangle = Y$, where $phi(X)$ stands for a possibly infinite dimensional non-linear feature map.
arXiv Detail & Related papers (2021-02-05T14:02:20Z) - On the Almost Sure Convergence of Stochastic Gradient Descent in
Non-Convex Problems [75.58134963501094]
This paper analyzes the trajectories of gradient descent (SGD)
We show that SGD avoids saddle points/manifolds with $1$ for strict step-size policies.
arXiv Detail & Related papers (2020-06-19T14:11:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.