The Convergence Rate of SGD's Final Iterate: Analysis on Dimension
Dependence
- URL: http://arxiv.org/abs/2106.14588v1
- Date: Mon, 28 Jun 2021 11:51:04 GMT
- Title: The Convergence Rate of SGD's Final Iterate: Analysis on Dimension
Dependence
- Authors: Daogao Liu, Zhou Lu
- Abstract summary: Gradient Descent (SGD) is among the simplest and most popular methods in optimization.
We show how to characterize the final-iterate convergence of SGD in the constant dimension setting.
- Score: 2.512827436728378
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Stochastic Gradient Descent (SGD) is among the simplest and most popular
methods in optimization. The convergence rate for SGD has been extensively
studied and tight analyses have been established for the running average
scheme, but the sub-optimality of the final iterate is still not
well-understood. shamir2013stochastic gave the best known upper bound for the
final iterate of SGD minimizing non-smooth convex functions, which is $O(\log
T/\sqrt{T})$ for Lipschitz convex functions and $O(\log T/ T)$ with additional
assumption on strongly convexity. The best known lower bounds, however, are
worse than the upper bounds by a factor of $\log T$. harvey2019tight gave
matching lower bounds but their construction requires dimension $d= T$. It was
then asked by koren2020open how to characterize the final-iterate convergence
of SGD in the constant dimension setting.
In this paper, we answer this question in the more general setting for any
$d\leq T$, proving $\Omega(\log d/\sqrt{T})$ and $\Omega(\log d/T)$ lower
bounds for the sub-optimality of the final iterate of SGD in minimizing
non-smooth Lipschitz convex and strongly convex functions respectively with
standard step size schedules. Our results provide the first general dimension
dependent lower bound on the convergence of SGD's final iterate, partially
resolving a COLT open question raised by koren2020open. We also present further
evidence to show the correct rate in one dimension should be
$\Theta(1/\sqrt{T})$, such as a proof of a tight $O(1/\sqrt{T})$ upper bound
for one-dimensional special cases in settings more general than koren2020open.
Related papers
- Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic
Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel.
An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost.
Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z) - On the $O(\frac{\sqrt{d}}{T^{1/4}})$ Convergence Rate of RMSProp and Its Momentum Extension Measured by $\ell_1$ Norm [59.65871549878937]
This paper considers the RMSProp and its momentum extension and establishes the convergence rate of $frac1Tsum_k=1T.
Our convergence rate matches the lower bound with respect to all the coefficients except the dimension $d$.
Our convergence rate can be considered to be analogous to the $frac1Tsum_k=1T.
arXiv Detail & Related papers (2024-02-01T07:21:32Z) - Revisiting the Last-Iterate Convergence of Stochastic Gradient Methods [25.831462008050387]
The Gradient Descent (SGD) algorithm has triggered people's interest due to its good performance in practice but lack of theoretical understanding.
It still remains unclear whether the lastiterate convergence can be provably extended to wider composite optimization and non-Euclidean norms.
arXiv Detail & Related papers (2023-12-13T21:41:06Z) - Convergence Rates for Stochastic Approximation: Biased Noise with Unbounded Variance, and Applications [2.0584253077707477]
We study the convergence properties of the Gradient Descent (SGD) method for finding a stationary point of an objective function $J(cdot)$.
Our results apply to a class of invex'' functions, which have the property that every stationary point is also a global minimizer.
arXiv Detail & Related papers (2023-12-05T15:22:39Z) - Lower Generalization Bounds for GD and SGD in Smooth Stochastic Convex
Optimization [9.019243171993553]
Training steps $T$ and step-size $eta$ might affect certify in smooth convex optimization (SCO) problems.
We first provide tight excess risk lower bounds for Gradient Descent (GD) and Gradient Descent (SGD)
Recent works show better rates can be attained but the improvement is reduced when training time is long.
arXiv Detail & Related papers (2023-03-19T20:24:33Z) - Private Stochastic Convex Optimization: Optimal Rates in $\ell_1$
Geometry [69.24618367447101]
Up to logarithmic factors the optimal excess population loss of any $(varepsilon,delta)$-differently private is $sqrtlog(d)/n + sqrtd/varepsilon n.$
We show that when the loss functions satisfy additional smoothness assumptions, the excess loss is upper bounded (up to logarithmic factors) by $sqrtlog(d)/n + (log(d)/varepsilon n)2/3.
arXiv Detail & Related papers (2021-03-02T06:53:44Z) - Last iterate convergence of SGD for Least-Squares in the Interpolation
regime [19.05750582096579]
We study the noiseless model in the fundamental least-squares setup.
We assume that an optimum predictor fits perfectly inputs and outputs $langle theta_*, phi(X) rangle = Y$, where $phi(X)$ stands for a possibly infinite dimensional non-linear feature map.
arXiv Detail & Related papers (2021-02-05T14:02:20Z) - Hybrid Stochastic-Deterministic Minibatch Proximal Gradient:
Less-Than-Single-Pass Optimization with Nearly Optimal Generalization [83.80460802169999]
We show that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
For loss factors, we prove that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
arXiv Detail & Related papers (2020-09-18T02:18:44Z) - On the Almost Sure Convergence of Stochastic Gradient Descent in
Non-Convex Problems [75.58134963501094]
This paper analyzes the trajectories of gradient descent (SGD)
We show that SGD avoids saddle points/manifolds with $1$ for strict step-size policies.
arXiv Detail & Related papers (2020-06-19T14:11:26Z) - A Simple Convergence Proof of Adam and Adagrad [74.24716715922759]
We show a proof of convergence between the Adam Adagrad and $O(d(N)/st)$ algorithms.
Adam converges with the same convergence $O(d(N)/st)$ when used with the default parameters.
arXiv Detail & Related papers (2020-03-05T01:56:17Z) - Closing the convergence gap of SGD without replacement [9.109788577327503]
gradient descent without replacement sampling is widely used in practice for model training.
We show that SGD without replacement achieves a rate of $mathcalOleft(frac1T2+fracn2T3right)$ when the sum of the functions is a quadratic.
arXiv Detail & Related papers (2020-02-24T17:37:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.