Beyond Uniform Smoothness: A Stopped Analysis of Adaptive SGD
- URL: http://arxiv.org/abs/2302.06570v1
- Date: Mon, 13 Feb 2023 18:13:36 GMT
- Title: Beyond Uniform Smoothness: A Stopped Analysis of Adaptive SGD
- Authors: Matthew Faw, Litu Rout, Constantine Caramanis, Sanjay Shakkottai
- Abstract summary: This work considers the problem of finding first-order stationary point of a non atau function with potentially constant smoothness using a gradient.
We develop a technique that allows us to prove $mathcalO(fracmathrmpolylog(T)sigmatT)$ convergence rates without assuming uniform bounds on the noise.
- Score: 38.221784575853796
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work considers the problem of finding a first-order stationary point of
a non-convex function with potentially unbounded smoothness constant using a
stochastic gradient oracle. We focus on the class of $(L_0,L_1)$-smooth
functions proposed by Zhang et al. (ICLR'20). Empirical evidence suggests that
these functions more closely captures practical machine learning problems as
compared to the pervasive $L_0$-smoothness. This class is rich enough to
include highly non-smooth functions, such as $\exp(L_1 x)$ which is
$(0,\mathcal{O}(L_1))$-smooth. Despite the richness, an emerging line of works
achieves the $\widetilde{\mathcal{O}}(\frac{1}{\sqrt{T}})$ rate of convergence
when the noise of the stochastic gradients is deterministically and uniformly
bounded. This noise restriction is not required in the $L_0$-smooth setting,
and in many practical settings is either not satisfied, or results in weaker
convergence rates with respect to the noise scaling of the convergence rate.
We develop a technique that allows us to prove
$\mathcal{O}(\frac{\mathrm{poly}\log(T)}{\sqrt{T}})$ convergence rates for
$(L_0,L_1)$-smooth functions without assuming uniform bounds on the noise
support. The key innovation behind our results is a carefully constructed
stopping time $\tau$ which is simultaneously "large" on average, yet also
allows us to treat the adaptive step sizes before $\tau$ as (roughly)
independent of the gradients. For general $(L_0,L_1)$-smooth functions, our
analysis requires the mild restriction that the multiplicative noise parameter
$\sigma_1 < 1$. For a broad subclass of $(L_0,L_1)$-smooth functions, our
convergence rate continues to hold when $\sigma_1 \geq 1$. By contrast, we
prove that many algorithms analyzed by prior works on $(L_0,L_1)$-smooth
optimization diverge with constant probability even for smooth and
strongly-convex functions when $\sigma_1 > 1$.
Related papers
- An Oblivious Stochastic Composite Optimization Algorithm for Eigenvalue
Optimization Problems [76.2042837251496]
We introduce two oblivious mirror descent algorithms based on a complementary composite setting.
Remarkably, both algorithms work without prior knowledge of the Lipschitz constant or smoothness of the objective function.
We show how to extend our framework to scale and demonstrate the efficiency and robustness of our methods on large scale semidefinite programs.
arXiv Detail & Related papers (2023-06-30T08:34:29Z) - High Probability Convergence of Stochastic Gradient Methods [15.829413808059124]
We show convergence with bounds depending on the initial distance to the optimal solution.
We demonstrate that our techniques can be used to obtain high bound for AdaGrad-Norm.
arXiv Detail & Related papers (2023-02-28T18:42:11Z) - Breaking the Lower Bound with (Little) Structure: Acceleration in
Non-Convex Stochastic Optimization with Heavy-Tailed Noise [28.780192812703948]
We consider the optimization problem with smooth but not necessarily convex objectives in the heavy-tailed noise regime.
We show that one can achieve a faster rate than that dictated by the lower bound $Omega(Tfrac1-p3p-2)$ with only tiny bit of structure.
We also establish that it guarantees a high-probability convergence rate of $O(log(T/delta)Tfrac1-p3p-2)$ under a mild condition.
arXiv Detail & Related papers (2023-02-14T00:23:42Z) - Near-Optimal Non-Convex Stochastic Optimization under Generalized
Smoothness [21.865728815935665]
Two recent works established the $O(epsilon-3)$ sample complexity to obtain an $O(epsilon)$-stationary point.
However, both require a large batch size on the order of $mathrmploy(epsilon-1)$, which is not only computationally burdensome but also unsuitable for streaming applications.
In this work, we solve the prior two problems simultaneously by revisiting a simple variant of the STORM algorithm.
arXiv Detail & Related papers (2023-02-13T00:22:28Z) - Randomized Block-Coordinate Optimistic Gradient Algorithms for
Root-Finding Problems [8.0153031008486]
We develop two new algorithms to approximate a solution of nonlinear equations in large-scale settings.
We apply our methods to a class of large-scale finite-sum inclusions, which covers prominent applications in machine learning.
arXiv Detail & Related papers (2023-01-08T21:46:27Z) - ReSQueing Parallel and Private Stochastic Convex Optimization [59.53297063174519]
We introduce a new tool for BFG convex optimization (SCO): a Reweighted Query (ReSQue) estimator for the gradient of a function convolved with a (Gaussian) probability density.
We develop algorithms achieving state-of-the-art complexities for SCO in parallel and private settings.
arXiv Detail & Related papers (2023-01-01T18:51:29Z) - Sharper Convergence Guarantees for Asynchronous SGD for Distributed and
Federated Learning [77.22019100456595]
We show a training algorithm for distributed computation workers with varying communication frequency.
In this work, we obtain a tighter convergence rate of $mathcalO!!!(sigma2-2_avg!! .
We also show that the heterogeneity term in rate is affected by the average delay within each worker.
arXiv Detail & Related papers (2022-06-16T17:10:57Z) - Optimal Gradient Sliding and its Application to Distributed Optimization
Under Similarity [121.83085611327654]
We structured convex optimization problems with additive objective $r:=p + q$, where $r$ is $mu$-strong convex similarity.
We proposed a method to solve problems master to agents' communication and local calls.
The proposed method is much sharper than the $mathcalO(sqrtL_q/mu)$ method.
arXiv Detail & Related papers (2022-05-30T14:28:02Z) - A first-order primal-dual method with adaptivity to local smoothness [64.62056765216386]
We consider the problem of finding a saddle point for the convex-concave objective $min_x max_y f(x) + langle Ax, yrangle - g*(y)$, where $f$ is a convex function with locally Lipschitz gradient and $g$ is convex and possibly non-smooth.
We propose an adaptive version of the Condat-Vu algorithm, which alternates between primal gradient steps and dual steps.
arXiv Detail & Related papers (2021-10-28T14:19:30Z) - Accelerating Optimization and Reinforcement Learning with
Quasi-Stochastic Approximation [2.294014185517203]
This paper sets out to extend convergence theory to quasi-stochastic approximations.
It is illustrated with applications to gradient-free optimization and policy gradient algorithms for reinforcement learning.
arXiv Detail & Related papers (2020-09-30T04:44:45Z) - Streaming Complexity of SVMs [110.63976030971106]
We study the space complexity of solving the bias-regularized SVM problem in the streaming model.
We show that for both problems, for dimensions of $frac1lambdaepsilon$, one can obtain streaming algorithms with spacely smaller than $frac1lambdaepsilon$.
arXiv Detail & Related papers (2020-07-07T17:10:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.