Convergence of AdaGrad for Non-convex Objectives: Simple Proofs and
Relaxed Assumptions
- URL: http://arxiv.org/abs/2305.18471v2
- Date: Thu, 28 Sep 2023 08:58:09 GMT
- Title: Convergence of AdaGrad for Non-convex Objectives: Simple Proofs and
Relaxed Assumptions
- Authors: Bohan Wang, Huishuai Zhang, Zhi-Ming Ma, Wei Chen
- Abstract summary: A new auxiliary function $xi$ helps eliminate the complexity of handling the correlation between the numerator and denominator of AdaGrads iterations.
We show that AdaGrad succeeds in converging under $(L_0)$smooth condition as long as the learning rate is lower than a threshold.
- Score: 33.49807210207286
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We provide a simple convergence proof for AdaGrad optimizing non-convex
objectives under only affine noise variance and bounded smoothness assumptions.
The proof is essentially based on a novel auxiliary function $\xi$ that helps
eliminate the complexity of handling the correlation between the numerator and
denominator of AdaGrad's update. Leveraging simple proofs, we are able to
obtain tighter results than existing results \citep{faw2022power} and extend
the analysis to several new and important cases. Specifically, for the
over-parameterized regime, we show that AdaGrad needs only
$\mathcal{O}(\frac{1}{\varepsilon^2})$ iterations to ensure the gradient norm
smaller than $\varepsilon$, which matches the rate of SGD and significantly
tighter than existing rates $\mathcal{O}(\frac{1}{\varepsilon^4})$ for AdaGrad.
We then discard the bounded smoothness assumption and consider a realistic
assumption on smoothness called $(L_0,L_1)$-smooth condition, which allows
local smoothness to grow with the gradient norm. Again based on the auxiliary
function $\xi$, we prove that AdaGrad succeeds in converging under
$(L_0,L_1)$-smooth condition as long as the learning rate is lower than a
threshold. Interestingly, we further show that the requirement on learning rate
under the $(L_0,L_1)$-smooth condition is necessary via proof by contradiction,
in contrast with the case of uniform smoothness conditions where convergence is
guaranteed regardless of learning rate choices. Together, our analyses broaden
the understanding of AdaGrad and demonstrate the power of the new auxiliary
function in the investigations of AdaGrad.
Related papers
- Convergence of Adam Under Relaxed Assumptions [72.24779199744954]
We show that Adam converges to $epsilon$-stationary points with $O(epsilon-4)$ gradient complexity under far more realistic conditions.
We also propose a variance-reduced version of Adam with an accelerated gradient complexity of $O(epsilon-3)$.
arXiv Detail & Related papers (2023-04-27T06:27:37Z) - A Note on Quantum Phase Estimation [0.0]
We show an alternative, simpler and self-contained proof of query lower bounds.
Specifically, our proof consists of basic linear algebra without using the knowledge of Boolean function analysis and adversary methods.
arXiv Detail & Related papers (2023-04-05T06:05:42Z) - High Probability Convergence of Stochastic Gradient Methods [15.829413808059124]
We show convergence with bounds depending on the initial distance to the optimal solution.
We demonstrate that our techniques can be used to obtain high bound for AdaGrad-Norm.
arXiv Detail & Related papers (2023-02-28T18:42:11Z) - Beyond Uniform Smoothness: A Stopped Analysis of Adaptive SGD [38.221784575853796]
This work considers the problem of finding first-order stationary point of a non atau function with potentially constant smoothness using a gradient.
We develop a technique that allows us to prove $mathcalO(fracmathrmpolylog(T)sigmatT)$ convergence rates without assuming uniform bounds on the noise.
arXiv Detail & Related papers (2023-02-13T18:13:36Z) - Sharper Rates and Flexible Framework for Nonconvex SGD with Client and
Data Sampling [64.31011847952006]
We revisit the problem of finding an approximately stationary point of the average of $n$ smooth and possibly non-color functions.
We generalize the $smallsfcolorgreen$ so that it can provably work with virtually any sampling mechanism.
We provide the most general and most accurate analysis of optimal bound in the smooth non-color regime.
arXiv Detail & Related papers (2022-06-05T21:32:33Z) - High Probability Bounds for a Class of Nonconvex Algorithms with AdaGrad
Stepsize [55.0090961425708]
We propose a new, simplified high probability analysis of AdaGrad for smooth, non- probability problems.
We present our analysis in a modular way and obtain a complementary $mathcal O (1 / TT)$ convergence rate in the deterministic setting.
To the best of our knowledge, this is the first high probability for AdaGrad with a truly adaptive scheme, i.e., completely oblivious to the knowledge of smoothness.
arXiv Detail & Related papers (2022-04-06T13:50:33Z) - A first-order primal-dual method with adaptivity to local smoothness [64.62056765216386]
We consider the problem of finding a saddle point for the convex-concave objective $min_x max_y f(x) + langle Ax, yrangle - g*(y)$, where $f$ is a convex function with locally Lipschitz gradient and $g$ is convex and possibly non-smooth.
We propose an adaptive version of the Condat-Vu algorithm, which alternates between primal gradient steps and dual steps.
arXiv Detail & Related papers (2021-10-28T14:19:30Z) - Faster Convergence of Stochastic Gradient Langevin Dynamics for
Non-Log-Concave Sampling [110.88857917726276]
We provide a new convergence analysis of gradient Langevin dynamics (SGLD) for sampling from a class of distributions that can be non-log-concave.
At the core of our approach is a novel conductance analysis of SGLD using an auxiliary time-reversible Markov Chain.
arXiv Detail & Related papers (2020-10-19T15:23:18Z) - Fast Dimension Independent Private AdaGrad on Publicly Estimated
Subspaces [24.52697154861819]
We show that noisy AdaGrad achieves a regret comparable to traditional AdaGrad plus a well-controlled term due to noise.
We operate with general convex functions in both constrained and unconstrained minimization.
arXiv Detail & Related papers (2020-08-14T20:46:38Z) - A Simple Convergence Proof of Adam and Adagrad [74.24716715922759]
We show a proof of convergence between the Adam Adagrad and $O(d(N)/st)$ algorithms.
Adam converges with the same convergence $O(d(N)/st)$ when used with the default parameters.
arXiv Detail & Related papers (2020-03-05T01:56:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.