Adam Can Converge Without Any Modification on Update Rules
- URL: http://arxiv.org/abs/2208.09632v2
- Date: Tue, 23 Aug 2022 16:10:44 GMT
- Title: Adam Can Converge Without Any Modification on Update Rules
- Authors: Yushun Zhang, Congliang Chen, Naichen Shi, Ruoyu Sun, Zhi-Quan Luo
- Abstract summary: vanilla Adam remains exceptionally popular and it works well in practice.
We prove that when $beta$ is large, Adam converges to the neighborhood of critical points.
Our divergence result considers the same setting as our convergence result, indicating a phase transition from divergence to convergence when increasing $beta$.
- Score: 24.575453562687095
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Ever since Reddi et al. 2018 pointed out the divergence issue of Adam, many
new variants have been designed to obtain convergence. However, vanilla Adam
remains exceptionally popular and it works well in practice. Why is there a gap
between theory and practice? We point out there is a mismatch between the
settings of theory and practice: Reddi et al. 2018 pick the problem after
picking the hyperparameters of Adam, i.e., $(\beta_1, \beta_2)$; while
practical applications often fix the problem first and then tune $(\beta_1,
\beta_2)$. Due to this observation, we conjecture that the empirical
convergence can be theoretically justified, only if we change the order of
picking the problem and hyperparameter. In this work, we confirm this
conjecture. We prove that, when $\beta_2$ is large and $\beta_1 <
\sqrt{\beta_2}<1$, Adam converges to the neighborhood of critical points. The
size of the neighborhood is propositional to the variance of stochastic
gradients. Under an extra condition (strong growth condition), Adam converges
to critical points. As $\beta_2$ increases, our convergence result can cover
any $\beta_1 \in [0,1)$ including $\beta_1=0.9$, which is the default setting
in deep learning libraries. To our knowledge, this is the first result showing
that Adam can converge under a wide range of hyperparameters {\it without any
modification} on its update rules. Further, our analysis does not require
assumptions of bounded gradients or bounded 2nd-order momentum. When $\beta_2$
is small, we further point out a large region of $(\beta_1,\beta_2)$ where Adam
can diverge to infinity. Our divergence result considers the same setting as
our convergence result, indicating a phase transition from divergence to
convergence when increasing $\beta_2$. These positive and negative results can
provide suggestions on how to tune Adam hyperparameters.
Related papers
- ADOPT: Modified Adam Can Converge with Any $β_2$ with the Optimal Rate [21.378608502899077]
We propose a new adaptive gradient method named ADOPT, which achieves the optimal convergence rate of $mathcalO without depending on the bounded noise assumption.
Our ADOPT achieves superior results compared to Adam and its variants across a wide range of tasks, including image classification, generative modeling, natural language processing, and deep reinforcement learning.
arXiv Detail & Related papers (2024-11-05T06:57:47Z) - Convergence Guarantees for RMSProp and Adam in Generalized-smooth Non-convex Optimization with Affine Noise Variance [23.112775335244258]
We first analyze RMSProp, which is a special case of Adam with adaptive learning rates but without first-order momentum.
We develop a new upper bound first-order term in the descent lemma, which is also a function of the gradient norm.
Our results for both RMSProp and Adam match with the complexity established in citearvani2023lower.
arXiv Detail & Related papers (2024-04-01T19:17:45Z) - Closing the Gap Between the Upper Bound and the Lower Bound of Adam's
Iteration Complexity [51.96093077151991]
We derive a new convergence guarantee of Adam, with only an $L$-smooth condition and a bounded noise variance assumption.
Our proof utilizes novel techniques to handle the entanglement between momentum and adaptive learning rate.
arXiv Detail & Related papers (2023-10-27T09:16:58Z) - The Sample Complexity of Online Contract Design [120.9833763323407]
We study the hidden-action principal-agent problem in an online setting.
In each round, the principal posts a contract that specifies the payment to the agent based on each outcome.
The agent then makes a strategic choice of action that maximizes her own utility, but the action is not directly observable by the principal.
arXiv Detail & Related papers (2022-11-10T17:59:42Z) - Fast Rates for the Regret of Offline Reinforcement Learning [69.23654172273085]
We study the regret of reinforcement learning from offline data generated by a fixed behavior policy in an infinitehorizon discounted decision process (MDP)
We show that given any estimate for the optimal quality function $Q*$, the regret of the policy it defines converges at a rate given by the exponentiation of the $Q*$-estimate's pointwise convergence rate.
arXiv Detail & Related papers (2021-01-31T16:17:56Z) - Adam$^+$: A Stochastic Method with Adaptive Variance Reduction [56.051001950733315]
Adam is a widely used optimization method for deep learning applications.
We propose a new method named Adam$+$ (pronounced as Adam-plus)
Our empirical studies on various deep learning tasks, including image classification, language modeling, and automatic speech recognition, demonstrate that Adam$+$ significantly outperforms Adam.
arXiv Detail & Related papers (2020-11-24T09:28:53Z) - A new regret analysis for Adam-type algorithms [78.825194932103]
In theory, regret guarantees for online convex optimization require a rapidly decaying $beta_1to0$ schedule.
We propose a novel framework that allows us to derive optimal, data-dependent regret bounds with a constant $beta_1$.
arXiv Detail & Related papers (2020-03-21T19:19:51Z) - A Simple Convergence Proof of Adam and Adagrad [74.24716715922759]
We show a proof of convergence between the Adam Adagrad and $O(d(N)/st)$ algorithms.
Adam converges with the same convergence $O(d(N)/st)$ when used with the default parameters.
arXiv Detail & Related papers (2020-03-05T01:56:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.