High Probability Bounds for a Class of Nonconvex Algorithms with AdaGrad
Stepsize
- URL: http://arxiv.org/abs/2204.02833v1
- Date: Wed, 6 Apr 2022 13:50:33 GMT
- Title: High Probability Bounds for a Class of Nonconvex Algorithms with AdaGrad
Stepsize
- Authors: Ali Kavis, Kfir Yehuda Levy, Volkan Cevher
- Abstract summary: We propose a new, simplified high probability analysis of AdaGrad for smooth, non- probability problems.
We present our analysis in a modular way and obtain a complementary $mathcal O (1 / TT)$ convergence rate in the deterministic setting.
To the best of our knowledge, this is the first high probability for AdaGrad with a truly adaptive scheme, i.e., completely oblivious to the knowledge of smoothness.
- Score: 55.0090961425708
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a new, simplified high probability analysis of
AdaGrad for smooth, non-convex problems. More specifically, we focus on a
particular accelerated gradient (AGD) template (Lan, 2020), through which we
recover the original AdaGrad and its variant with averaging, and prove a
convergence rate of $\mathcal O (1/ \sqrt{T})$ with high probability without
the knowledge of smoothness and variance. We use a particular version of
Freedman's concentration bound for martingale difference sequences (Kakade &
Tewari, 2008) which enables us to achieve the best-known dependence of $\log (1
/ \delta )$ on the probability margin $\delta$. We present our analysis in a
modular way and obtain a complementary $\mathcal O (1 / T)$ convergence rate in
the deterministic setting. To the best of our knowledge, this is the first high
probability result for AdaGrad with a truly adaptive scheme, i.e., completely
oblivious to the knowledge of smoothness and uniform variance bound, which
simultaneously has best-known dependence of $\log( 1/ \delta)$. We further
prove noise adaptation property of AdaGrad under additional noise assumptions.
Related papers
- Breaking the Heavy-Tailed Noise Barrier in Stochastic Optimization Problems [56.86067111855056]
We consider clipped optimization problems with heavy-tailed noise with structured density.
We show that it is possible to get faster rates of convergence than $mathcalO(K-(alpha - 1)/alpha)$, when the gradients have finite moments of order.
We prove that the resulting estimates have negligible bias and controllable variance.
arXiv Detail & Related papers (2023-11-07T17:39:17Z) - High-probability Convergence Bounds for Nonlinear Stochastic Gradient Descent Under Heavy-tailed Noise [59.25598762373543]
We show that wetailed high-prob convergence guarantees of learning on streaming data in the presence of heavy-tailed noise.
We demonstrate analytically and that $ta$ can be used to the preferred choice of setting for a given problem.
arXiv Detail & Related papers (2023-10-28T18:53:41Z) - Convergence of AdaGrad for Non-convex Objectives: Simple Proofs and
Relaxed Assumptions [33.49807210207286]
A new auxiliary function $xi$ helps eliminate the complexity of handling the correlation between the numerator and denominator of AdaGrads iterations.
We show that AdaGrad succeeds in converging under $(L_0)$smooth condition as long as the learning rate is lower than a threshold.
arXiv Detail & Related papers (2023-05-29T09:33:04Z) - High Probability Convergence of Stochastic Gradient Methods [15.829413808059124]
We show convergence with bounds depending on the initial distance to the optimal solution.
We demonstrate that our techniques can be used to obtain high bound for AdaGrad-Norm.
arXiv Detail & Related papers (2023-02-28T18:42:11Z) - SGD with AdaGrad Stepsizes: Full Adaptivity with High Probability to
Unknown Parameters, Unbounded Gradients and Affine Variance [33.593203156666746]
We show that AdaGrad stepsizes a popular adaptive (self-tuning) method for first-order optimization.
In both the low-noise and high-regimes we find sharp rates of convergence in both the low-noise and high-regimes.
arXiv Detail & Related papers (2023-02-17T09:46:08Z) - Adaptive Stochastic Variance Reduction for Non-convex Finite-Sum
Minimization [52.25843977506935]
We propose an adaptive variance method, called AdaSpider, for $L$-smooth, non-reduction functions with a finitesum structure.
In doing so, we are able to compute an $epsilon-stationary point with $tildeOleft + st/epsilon calls.
arXiv Detail & Related papers (2022-11-03T14:41:46Z) - Convergence Rates of Stochastic Gradient Descent under Infinite Noise
Variance [14.06947898164194]
Heavy tails emerge in gradient descent (SGD) in various scenarios.
We provide convergence guarantees for SGD under a state-dependent and heavy-tailed noise with a potentially infinite variance.
Our results indicate that even under heavy-tailed noise with infinite variance, SGD can converge to the global optimum.
arXiv Detail & Related papers (2021-02-20T13:45:11Z) - On the Almost Sure Convergence of Stochastic Gradient Descent in
Non-Convex Problems [75.58134963501094]
This paper analyzes the trajectories of gradient descent (SGD)
We show that SGD avoids saddle points/manifolds with $1$ for strict step-size policies.
arXiv Detail & Related papers (2020-06-19T14:11:26Z) - A Simple Convergence Proof of Adam and Adagrad [74.24716715922759]
We show a proof of convergence between the Adam Adagrad and $O(d(N)/st)$ algorithms.
Adam converges with the same convergence $O(d(N)/st)$ when used with the default parameters.
arXiv Detail & Related papers (2020-03-05T01:56:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.