Revisiting the Last-Iterate Convergence of Stochastic Gradient Methods
- URL: http://arxiv.org/abs/2312.08531v2
- Date: Tue, 12 Mar 2024 03:53:46 GMT
- Title: Revisiting the Last-Iterate Convergence of Stochastic Gradient Methods
- Authors: Zijian Liu, Zhengyuan Zhou
- Abstract summary: The Gradient Descent (SGD) algorithm has triggered people's interest due to its good performance in practice but lack of theoretical understanding.
It still remains unclear whether the lastiterate convergence can be provably extended to wider composite optimization and non-Euclidean norms.
- Score: 25.831462008050387
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the past several years, the last-iterate convergence of the Stochastic
Gradient Descent (SGD) algorithm has triggered people's interest due to its
good performance in practice but lack of theoretical understanding. For
Lipschitz convex functions, different works have established the optimal
$O(\log(1/\delta)\log T/\sqrt{T})$ or $O(\sqrt{\log(1/\delta)/T})$
high-probability convergence rates for the final iterate, where $T$ is the time
horizon and $\delta$ is the failure probability. However, to prove these
bounds, all the existing works are either limited to compact domains or require
almost surely bounded noises. It is natural to ask whether the last iterate of
SGD can still guarantee the optimal convergence rate but without these two
restrictive assumptions. Besides this important question, there are still lots
of theoretical problems lacking an answer. For example, compared with the
last-iterate convergence of SGD for non-smooth problems, only few results for
smooth optimization have yet been developed. Additionally, the existing results
are all limited to a non-composite objective and the standard Euclidean norm.
It still remains unclear whether the last-iterate convergence can be provably
extended to wider composite optimization and non-Euclidean norms. In this work,
to address the issues mentioned above, we revisit the last-iterate convergence
of stochastic gradient methods and provide the first unified way to prove the
convergence rates both in expectation and in high probability to accommodate
general domains, composite objectives, non-Euclidean norms, Lipschitz
conditions, smoothness, and (strong) convexity simultaneously. Additionally, we
extend our analysis to obtain the last-iterate convergence under heavy-tailed
noises.
Related papers
- Breaking the Heavy-Tailed Noise Barrier in Stochastic Optimization Problems [56.86067111855056]
We consider clipped optimization problems with heavy-tailed noise with structured density.
We show that it is possible to get faster rates of convergence than $mathcalO(K-(alpha - 1)/alpha)$, when the gradients have finite moments of order.
We prove that the resulting estimates have negligible bias and controllable variance.
arXiv Detail & Related papers (2023-11-07T17:39:17Z) - Stochastic Nonsmooth Convex Optimization with Heavy-Tailed Noises:
High-Probability Bound, In-Expectation Rate and Initial Distance Adaptation [22.758674468435302]
In a heavy-tailed noise regime, the difference between the gradient and the true rate is assumed to have a finite $p-th moment.
This paper provides a comprehensive analysis of nonsmooth convex optimization with heavy-tailed noises.
arXiv Detail & Related papers (2023-03-22T03:05:28Z) - High Probability Convergence of Stochastic Gradient Methods [15.829413808059124]
We show convergence with bounds depending on the initial distance to the optimal solution.
We demonstrate that our techniques can be used to obtain high bound for AdaGrad-Norm.
arXiv Detail & Related papers (2023-02-28T18:42:11Z) - High-Probability Bounds for Stochastic Optimization and Variational
Inequalities: the Case of Unbounded Variance [59.211456992422136]
We propose algorithms with high-probability convergence results under less restrictive assumptions.
These results justify the usage of the considered methods for solving problems that do not fit standard functional classes in optimization.
arXiv Detail & Related papers (2023-02-02T10:37:23Z) - Optimal Extragradient-Based Bilinearly-Coupled Saddle-Point Optimization [116.89941263390769]
We consider the smooth convex-concave bilinearly-coupled saddle-point problem, $min_mathbfxmax_mathbfyF(mathbfx) + H(mathbfx,mathbfy)$, where one has access to first-order oracles for $F$, $G$ as well as the bilinear coupling function $H$.
We present a emphaccelerated gradient-extragradient (AG-EG) descent-ascent algorithm that combines extragrad
arXiv Detail & Related papers (2022-06-17T06:10:20Z) - The Convergence Rate of SGD's Final Iterate: Analysis on Dimension
Dependence [2.512827436728378]
Gradient Descent (SGD) is among the simplest and most popular methods in optimization.
We show how to characterize the final-iterate convergence of SGD in the constant dimension setting.
arXiv Detail & Related papers (2021-06-28T11:51:04Z) - High-probability Bounds for Non-Convex Stochastic Optimization with
Heavy Tails [55.561406656549686]
We consider non- Hilbert optimization using first-order algorithms for which the gradient estimates may have tails.
We show that a combination of gradient, momentum, and normalized gradient descent convergence to critical points in high-probability with best-known iteration for smooth losses.
arXiv Detail & Related papers (2021-06-28T00:17:01Z) - Convergence Rates of Stochastic Gradient Descent under Infinite Noise
Variance [14.06947898164194]
Heavy tails emerge in gradient descent (SGD) in various scenarios.
We provide convergence guarantees for SGD under a state-dependent and heavy-tailed noise with a potentially infinite variance.
Our results indicate that even under heavy-tailed noise with infinite variance, SGD can converge to the global optimum.
arXiv Detail & Related papers (2021-02-20T13:45:11Z) - Last iterate convergence of SGD for Least-Squares in the Interpolation
regime [19.05750582096579]
We study the noiseless model in the fundamental least-squares setup.
We assume that an optimum predictor fits perfectly inputs and outputs $langle theta_*, phi(X) rangle = Y$, where $phi(X)$ stands for a possibly infinite dimensional non-linear feature map.
arXiv Detail & Related papers (2021-02-05T14:02:20Z) - ROOT-SGD: Sharp Nonasymptotics and Near-Optimal Asymptotics in a Single Algorithm [71.13558000599839]
We study the problem of solving strongly convex and smooth unconstrained optimization problems using first-order algorithms.
We devise a novel, referred to as Recursive One-Over-T SGD, based on an easily implementable, averaging of past gradients.
We prove that it simultaneously achieves state-of-the-art performance in both a finite-sample, nonasymptotic sense and an sense.
arXiv Detail & Related papers (2020-08-28T14:46:56Z) - A Simple Convergence Proof of Adam and Adagrad [74.24716715922759]
We show a proof of convergence between the Adam Adagrad and $O(d(N)/st)$ algorithms.
Adam converges with the same convergence $O(d(N)/st)$ when used with the default parameters.
arXiv Detail & Related papers (2020-03-05T01:56:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.