SGD Generalizes Better Than GD (And Regularization Doesn't Help)
- URL: http://arxiv.org/abs/2102.01117v1
- Date: Mon, 1 Feb 2021 19:18:40 GMT
- Title: SGD Generalizes Better Than GD (And Regularization Doesn't Help)
- Authors: Idan Amir, Tomer Koren, Roi Livni
- Abstract summary: We give a new separation result between the generalization performance of gradient descent (SGD) and of full-batch gradient descent (GD)
We show that with the same number of steps GD may overfit and emit a solution with $Omega(1)$ generalization error.
We discuss how regularizing the empirical risk minimized by GD essentially does not change the above result, and revisit the concepts of stability, implicit bias and the role of the learning algorithm in generalization.
- Score: 39.588906680621825
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We give a new separation result between the generalization performance of
stochastic gradient descent (SGD) and of full-batch gradient descent (GD) in
the fundamental stochastic convex optimization model. While for SGD it is
well-known that $O(1/\epsilon^2)$ iterations suffice for obtaining a solution
with $\epsilon$ excess expected risk, we show that with the same number of
steps GD may overfit and emit a solution with $\Omega(1)$ generalization error.
Moreover, we show that in fact $\Omega(1/\epsilon^4)$ iterations are necessary
for GD to match the generalization performance of SGD, which is also tight due
to recent work by Bassily et al. (2020). We further discuss how regularizing
the empirical risk minimized by GD essentially does not change the above
result, and revisit the concepts of stability, implicit bias and the role of
the learning algorithm in generalization.
Related papers
- The Dimension Strikes Back with Gradients: Generalization of Gradient
Methods in Stochastic Convex Optimization [30.26365073195728]
We study the generalization performance of gradient methods in the fundamental convex optimization setting.
We show that an application of the same construction technique provides a similar $Omega(sqrtd)$ lower bound for the sample complexity of SGD to reach a non-trivial empirical error.
arXiv Detail & Related papers (2024-01-22T15:50:32Z) - On the Trajectories of SGD Without Replacement [0.0]
This article examines the implicit regularization effect of Gradient Descent (SGD)
We consider the case of SGD without replacement, the variant typically used to optimize large-scale neural networks.
arXiv Detail & Related papers (2023-12-26T18:06:48Z) - Implicit Regularization or Implicit Conditioning? Exact Risk
Trajectories of SGD in High Dimensions [26.782342518986503]
gradient descent (SGD) is a pillar of modern machine learning, serving as the go-to optimization algorithm for a diverse array of problems.
We show how to adapt the HSGD formalism to include streaming SGD, which allows us to produce an exact prediction for the excess risk of multi-pass SGD relative to that of streaming SGD.
arXiv Detail & Related papers (2022-06-15T02:32:26Z) - Risk Bounds of Multi-Pass SGD for Least Squares in the Interpolation
Regime [127.21287240963859]
gradient descent (SGD) has achieved great success due to its superior performance in both optimization and generalization.
This paper aims to sharply characterize the generalization of multi-pass SGD.
We show that although SGD needs more than GD to achieve the same level of excess risk, it saves the number of gradient evaluations.
arXiv Detail & Related papers (2022-03-07T06:34:53Z) - Benign Underfitting of Stochastic Gradient Descent [72.38051710389732]
We study to what extent may gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit training data.
We analyze the closely related with-replacement SGD, for which an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate.
arXiv Detail & Related papers (2022-02-27T13:25:01Z) - The Benefits of Implicit Regularization from SGD in Least Squares
Problems [116.85246178212616]
gradient descent (SGD) exhibits strong algorithmic regularization effects in practice.
We make comparisons of the implicit regularization afforded by (unregularized) average SGD with the explicit regularization of ridge regression.
arXiv Detail & Related papers (2021-08-10T09:56:47Z) - SGD: The Role of Implicit Regularization, Batch-size and Multiple-epochs [30.41773138781369]
We present a multi-epoch variant of Gradient Descent (SGD) commonly used in practice.
We prove that this is at least as good as single pass SGD in the worst case.
For certain SCO problems, taking multiple passes over the dataset can significantly outperform single pass SGD.
arXiv Detail & Related papers (2021-07-11T15:50:01Z) - Label Noise SGD Provably Prefers Flat Global Minimizers [48.883469271546076]
In overparametrized models, the noise in gradient descent (SGD) implicitly regularizes the optimization trajectory and determines which local minimum SGD converges to.
We show that SGD with label noise converges to a stationary point of a regularized loss $L(theta) +lambda R(theta)$, where $L(theta)$ is the training loss.
Our analysis uncovers an additional regularization effect of large learning rates beyond the linear scaling rule that penalizes large eigenvalues of the Hessian more than small ones.
arXiv Detail & Related papers (2021-06-11T17:59:07Z) - Stability of Stochastic Gradient Descent on Nonsmooth Convex Losses [52.039438701530905]
We provide sharp upper and lower bounds for several forms of gradient descent (SGD) on arbitrary Lipschitz nonsmooth convex losses.
Our bounds allow us to derive a new algorithm for differentially private nonsmooth convex optimization with optimal excess population risk.
arXiv Detail & Related papers (2020-06-12T02:45:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.