Rapid Overfitting of Multi-Pass Stochastic Gradient Descent in Stochastic Convex Optimization
- URL: http://arxiv.org/abs/2505.08306v1
- Date: Tue, 13 May 2025 07:32:48 GMT
- Title: Rapid Overfitting of Multi-Pass Stochastic Gradient Descent in Stochastic Convex Optimization
- Authors: Shira Vansover-Hager, Tomer Koren, Roi Livni,
- Abstract summary: We study the out-of-sample performance of multi-pass gradient descent (SGD) in the fundamental convex optimization (SCO) model.<n>We show that in the general non-smooth case of SCO, just a few epochs of SGD can already hurt its out-of-sample significantly and lead to overfitting.
- Score: 34.451177321785146
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the out-of-sample performance of multi-pass stochastic gradient descent (SGD) in the fundamental stochastic convex optimization (SCO) model. While one-pass SGD is known to achieve an optimal $\Theta(1/\sqrt{n})$ excess population loss given a sample of size $n$, much less is understood about the multi-pass version of the algorithm which is widely used in practice. Somewhat surprisingly, we show that in the general non-smooth case of SCO, just a few epochs of SGD can already hurt its out-of-sample performance significantly and lead to overfitting. In particular, using a step size $\eta = \Theta(1/\sqrt{n})$, which gives the optimal rate after one pass, can lead to population loss as large as $\Omega(1)$ after just one additional pass. More generally, we show that the population loss from the second pass onward is of the order $\Theta(1/(\eta T) + \eta \sqrt{T})$, where $T$ is the total number of steps. These results reveal a certain phase-transition in the out-of-sample behavior of SGD after the first epoch, as well as a sharp separation between the rates of overfitting in the smooth and non-smooth cases of SCO. Additionally, we extend our results to with-replacement SGD, proving that the same asymptotic bounds hold after $O(n \log n)$ steps. Finally, we also prove a lower bound of $\Omega(\eta \sqrt{n})$ on the generalization gap of one-pass SGD in dimension $d = \smash{\widetilde O}(n)$, improving on recent results of Koren et al.(2022) and Schliserman et al.(2024).
Related papers
- Fast Last-Iterate Convergence of SGD in the Smooth Interpolation Regime [26.711510824243803]
We study population convergence guarantees of gradient descent (SGD) for smooth convex objectives in the regime, where the noise at optimum is zero or near zero.<n>For a well-tuned stepsize we obtain a near optimal $widetildeO (1/T + sigma_star/sqrtT)$ rate for the last iterate.
arXiv Detail & Related papers (2025-07-15T12:52:47Z) - From Continual Learning to SGD and Back: Better Rates for Continual Linear Models [50.11453013647086]
We analyze the forgetting, i.e., loss on previously seen tasks, after $k$ iterations.<n>We develop novel last-iterate upper bounds in the realizable least squares setup.<n>We prove for the first time that randomization alone, with no task repetition, can prevent catastrophic in sufficiently long task sequences.
arXiv Detail & Related papers (2025-04-06T18:39:45Z) - Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency [47.8739414267201]
We consider gradient descent (GD) with a constant stepsize applied to logistic regression with linearly separable data.
We show that GD exits this initial oscillatory phase rapidly -- in $mathcalO(eta)$ steps -- and subsequently achieves an $tildemathcalO (1 / (eta t) )$ convergence rate.
Our results imply that, given a budget of $T$ steps, GD can achieve an accelerated loss of $tildemathcalO (1/T2)$ with an aggressive stepsize
arXiv Detail & Related papers (2024-02-24T23:10:28Z) - Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic
Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel.
An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost.
Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z) - Lower Generalization Bounds for GD and SGD in Smooth Stochastic Convex
Optimization [9.019243171993553]
Training steps $T$ and step-size $eta$ might affect certify in smooth convex optimization (SCO) problems.
We first provide tight excess risk lower bounds for Gradient Descent (GD) and Gradient Descent (SGD)
Recent works show better rates can be attained but the improvement is reduced when training time is long.
arXiv Detail & Related papers (2023-03-19T20:24:33Z) - Towards Noise-adaptive, Problem-adaptive Stochastic Gradient Descent [7.176107039687231]
We design step-size schemes that make gradient descent (SGD) adaptive to (i) the noise.
We prove that $T$ iterations of SGD with Nesterov iterations can be near optimal.
Compared to other step-size schemes, we demonstrate the effectiveness of a novel novel exponential step-size scheme.
arXiv Detail & Related papers (2021-10-21T19:22:14Z) - What Happens after SGD Reaches Zero Loss? --A Mathematical Framework [35.31946061894308]
Understanding the implicit bias of Gradient Descent (SGD) is one of the key challenges in deep learning.
This paper gives a general framework for such analysis by adapting ideas from Katzenberger (1991).
It yields some new results: (1) a global analysis of the implicit bias valid for $eta-2$ steps, in contrast to the local analysis of Blanc et al. ( 2020) that is only valid for $eta-1.6$ steps and (2) allowing arbitrary noise covariance.
arXiv Detail & Related papers (2021-10-13T17:50:46Z) - The Convergence Rate of SGD's Final Iterate: Analysis on Dimension
Dependence [2.512827436728378]
Gradient Descent (SGD) is among the simplest and most popular methods in optimization.
We show how to characterize the final-iterate convergence of SGD in the constant dimension setting.
arXiv Detail & Related papers (2021-06-28T11:51:04Z) - On the Almost Sure Convergence of Stochastic Gradient Descent in
Non-Convex Problems [75.58134963501094]
This paper analyzes the trajectories of gradient descent (SGD)
We show that SGD avoids saddle points/manifolds with $1$ for strict step-size policies.
arXiv Detail & Related papers (2020-06-19T14:11:26Z) - Almost sure convergence rates for Stochastic Gradient Descent and
Stochastic Heavy Ball [17.33867778750777]
We study gradient descent (SGD) and the heavy ball method (SHB) for the general approximation problem.
For SGD, in the convex and smooth setting, we provide the first emphalmost sure convergence emphrates for a weighted average of the iterates.
arXiv Detail & Related papers (2020-06-14T11:12:05Z) - On Tight Convergence Rates of Without-replacement SGD [52.99237600875452]
For solving finite-sum optimization problems, SGD without replacement sampling is empirically shown to outperform SGD.
In this work, we overcome these limitations by analyzing step sizes that vary across epochs.
arXiv Detail & Related papers (2020-04-18T16:14:11Z) - A Simple Convergence Proof of Adam and Adagrad [74.24716715922759]
We show a proof of convergence between the Adam Adagrad and $O(d(N)/st)$ algorithms.
Adam converges with the same convergence $O(d(N)/st)$ when used with the default parameters.
arXiv Detail & Related papers (2020-03-05T01:56:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.