Fast Last-Iterate Convergence of SGD in the Smooth Interpolation Regime
- URL: http://arxiv.org/abs/2507.11274v2
- Date: Mon, 28 Jul 2025 11:03:24 GMT
- Title: Fast Last-Iterate Convergence of SGD in the Smooth Interpolation Regime
- Authors: Amit Attia, Matan Schliserman, Uri Sherman, Tomer Koren,
- Abstract summary: We study population convergence guarantees of gradient descent (SGD) for smooth convex objectives in the regime, where the noise at optimum is zero or near zero.<n>For a well-tuned stepsize we obtain a near optimal $widetildeO (1/T + sigma_star/sqrtT)$ rate for the last iterate.
- Score: 26.711510824243803
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study population convergence guarantees of stochastic gradient descent (SGD) for smooth convex objectives in the interpolation regime, where the noise at optimum is zero or near zero. The behavior of the last iterate of SGD in this setting -- particularly with large (constant) stepsizes -- has received growing attention in recent years due to implications for the training of over-parameterized models, as well as to analyzing forgetting in continual learning and to understanding the convergence of the randomized Kaczmarz method for solving linear systems. We establish that after $T$ steps of SGD on $\beta$-smooth convex loss functions with stepsize $0 < \eta < 2/\beta$, the last iterate exhibits expected excess risk $\widetilde{O}(\frac{1}{\eta (2-\beta \eta) T^{1-\beta\eta/2}} + \frac{\eta}{(2-\beta\eta)^2} T^{\beta\eta/2} \sigma_\star^2)$, where $\sigma_\star^2$ denotes the variance of the stochastic gradients at the optimum. In particular, for a well-tuned stepsize we obtain a near optimal $\widetilde{O}(1/T + \sigma_\star/\sqrt{T})$ rate for the last iterate, extending the results of Varre et al. (2021) beyond least squares regression; and when $\sigma_\star=0$ we obtain a rate of $\smash{O(1/\sqrt T)}$ with $\eta=1/\beta$, improving upon the best-known $\smash{O(T^{-1/4})}$ rate recently established by Evron et al. (2025) in the special case of realizable linear regression.
Related papers
- Constant Stepsize Local GD for Logistic Regression: Acceleration by Instability [13.332982107151434]
We analyze Local Gradient Descent for logistic regression with separable, heterogeneous data using any stepsize $eta > 0$.<n>Our analysis parallels the single machine analysis ofcitewu2024large in which instability is caused by extremely large stepsizes.
arXiv Detail & Related papers (2025-06-16T20:29:00Z) - Rapid Overfitting of Multi-Pass Stochastic Gradient Descent in Stochastic Convex Optimization [34.451177321785146]
We study the out-of-sample performance of multi-pass gradient descent (SGD) in the fundamental convex optimization (SCO) model.<n>We show that in the general non-smooth case of SCO, just a few epochs of SGD can already hurt its out-of-sample significantly and lead to overfitting.
arXiv Detail & Related papers (2025-05-13T07:32:48Z) - From Continual Learning to SGD and Back: Better Rates for Continual Linear Models [50.11453013647086]
We analyze the forgetting, i.e., loss on previously seen tasks, after $k$ iterations.<n>We develop novel last-iterate upper bounds in the realizable least squares setup.<n>We prove for the first time that randomization alone, with no task repetition, can prevent catastrophic in sufficiently long task sequences.
arXiv Detail & Related papers (2025-04-06T18:39:45Z) - Variable Selection in Convex Piecewise Linear Regression [5.366354612549172]
This paper presents Sparse Gradient as a solution for variable selection in convex piecewise linear regression.
A non-asymptotic local convergence analysis is provided for SpGD under subGaussian noise.
arXiv Detail & Related papers (2024-11-04T16:19:09Z) - Nonasymptotic Analysis of Stochastic Gradient Descent with the Richardson-Romberg Extrapolation [22.652143194356864]
We address the problem of solving strongly convex and smooth problems using gradient descent (SGD) with a constant step size.<n>We provide an expansion of the mean-squared error of the resulting estimator with respect to the number of iterations $n$.<n>Our analysis relies on the properties of the SGDs viewed as a time-homogeneous Markov chain.
arXiv Detail & Related papers (2024-10-07T15:02:48Z) - Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic
Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel.
An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost.
Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z) - ReSQueing Parallel and Private Stochastic Convex Optimization [59.53297063174519]
We introduce a new tool for BFG convex optimization (SCO): a Reweighted Query (ReSQue) estimator for the gradient of a function convolved with a (Gaussian) probability density.
We develop algorithms achieving state-of-the-art complexities for SCO in parallel and private settings.
arXiv Detail & Related papers (2023-01-01T18:51:29Z) - Optimal Extragradient-Based Bilinearly-Coupled Saddle-Point Optimization [116.89941263390769]
We consider the smooth convex-concave bilinearly-coupled saddle-point problem, $min_mathbfxmax_mathbfyF(mathbfx) + H(mathbfx,mathbfy)$, where one has access to first-order oracles for $F$, $G$ as well as the bilinear coupling function $H$.
We present a emphaccelerated gradient-extragradient (AG-EG) descent-ascent algorithm that combines extragrad
arXiv Detail & Related papers (2022-06-17T06:10:20Z) - Optimal and instance-dependent guarantees for Markovian linear stochastic approximation [47.912511426974376]
We show a non-asymptotic bound of the order $t_mathrmmix tfracdn$ on the squared error of the last iterate of a standard scheme.
We derive corollaries of these results for policy evaluation with Markov noise.
arXiv Detail & Related papers (2021-12-23T18:47:50Z) - Last iterate convergence of SGD for Least-Squares in the Interpolation
regime [19.05750582096579]
We study the noiseless model in the fundamental least-squares setup.
We assume that an optimum predictor fits perfectly inputs and outputs $langle theta_*, phi(X) rangle = Y$, where $phi(X)$ stands for a possibly infinite dimensional non-linear feature map.
arXiv Detail & Related papers (2021-02-05T14:02:20Z) - Optimal Robust Linear Regression in Nearly Linear Time [97.11565882347772]
We study the problem of high-dimensional robust linear regression where a learner is given access to $n$ samples from the generative model $Y = langle X,w* rangle + epsilon$
We propose estimators for this problem under two settings: (i) $X$ is L4-L2 hypercontractive, $mathbbE [XXtop]$ has bounded condition number and $epsilon$ has bounded variance and (ii) $X$ is sub-Gaussian with identity second moment and $epsilon$ is
arXiv Detail & Related papers (2020-07-16T06:44:44Z) - Almost sure convergence rates for Stochastic Gradient Descent and
Stochastic Heavy Ball [17.33867778750777]
We study gradient descent (SGD) and the heavy ball method (SHB) for the general approximation problem.
For SGD, in the convex and smooth setting, we provide the first emphalmost sure convergence emphrates for a weighted average of the iterates.
arXiv Detail & Related papers (2020-06-14T11:12:05Z) - A Simple Convergence Proof of Adam and Adagrad [74.24716715922759]
We show a proof of convergence between the Adam Adagrad and $O(d(N)/st)$ algorithms.
Adam converges with the same convergence $O(d(N)/st)$ when used with the default parameters.
arXiv Detail & Related papers (2020-03-05T01:56:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.