Benign Underfitting of Stochastic Gradient Descent
- URL: http://arxiv.org/abs/2202.13361v2
- Date: Tue, 1 Mar 2022 07:18:12 GMT
- Title: Benign Underfitting of Stochastic Gradient Descent
- Authors: Tomer Koren, Roi Livni, Yishay Mansour, Uri Sherman
- Abstract summary: We study to what extent may gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit training data.
We analyze the closely related with-replacement SGD, for which an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate.
- Score: 72.38051710389732
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study to what extent may stochastic gradient descent (SGD) be understood
as a "conventional" learning rule that achieves generalization performance by
obtaining a good fit to training data. We consider the fundamental stochastic
convex optimization framework, where (one pass, without-replacement) SGD is
classically known to minimize the population risk at rate $O(1/\sqrt n)$, and
prove that, surprisingly, there exist problem instances where the SGD solution
exhibits both empirical risk and generalization gap of $\Omega(1)$.
Consequently, it turns out that SGD is not algorithmically stable in any sense,
and its generalization ability cannot be explained by uniform convergence or
any other currently known generalization bound technique for that matter (other
than that of its classical analysis). We then continue to analyze the closely
related with-replacement SGD, for which we show that an analogous phenomenon
does not occur and prove that its population risk does in fact converge at the
optimal rate. Finally, we interpret our main results in the context of
without-replacement SGD for finite-sum convex optimization problems, and derive
upper and lower bounds for the multi-epoch regime that significantly improve
upon previously known results.
Related papers
- Improving Implicit Regularization of SGD with Preconditioning for Least Square Problems [19.995877680083105]
We study the generalization performance of gradient descent (SGD) with preconditioning for the least squared problem.
We show that our proposed preconditioning matrix is straightforward enough to allow robust estimation from finite samples.
arXiv Detail & Related papers (2024-03-13T14:42:06Z) - Empirical Risk Minimization with Shuffled SGD: A Primal-Dual Perspective
and Improved Bounds [12.699376765058137]
gradient descent (SGD) is perhaps the most prevalent optimization method in modern machine learning.
It is only very recently that SGD with sampling without replacement -- shuffled SGD -- has been analyzed.
We prove fine-grained complexity bounds that depend on the data matrix and are never worse than what is predicted by the existing bounds.
arXiv Detail & Related papers (2023-06-21T18:14:44Z) - From Gradient Flow on Population Loss to Learning with Stochastic
Gradient Descent [50.4531316289086]
Gradient Descent (SGD) has been the method of choice for learning large-scale non-root models.
An overarching paper is providing general conditions SGD converges, assuming that GF on the population loss converges.
We provide a unified analysis for GD/SGD not only for classical settings like convex losses, but also for more complex problems including Retrieval Matrix sq-root.
arXiv Detail & Related papers (2022-10-13T03:55:04Z) - Implicit Regularization or Implicit Conditioning? Exact Risk
Trajectories of SGD in High Dimensions [26.782342518986503]
gradient descent (SGD) is a pillar of modern machine learning, serving as the go-to optimization algorithm for a diverse array of problems.
We show how to adapt the HSGD formalism to include streaming SGD, which allows us to produce an exact prediction for the excess risk of multi-pass SGD relative to that of streaming SGD.
arXiv Detail & Related papers (2022-06-15T02:32:26Z) - Risk Bounds of Multi-Pass SGD for Least Squares in the Interpolation
Regime [127.21287240963859]
gradient descent (SGD) has achieved great success due to its superior performance in both optimization and generalization.
This paper aims to sharply characterize the generalization of multi-pass SGD.
We show that although SGD needs more than GD to achieve the same level of excess risk, it saves the number of gradient evaluations.
arXiv Detail & Related papers (2022-03-07T06:34:53Z) - The Benefits of Implicit Regularization from SGD in Least Squares
Problems [116.85246178212616]
gradient descent (SGD) exhibits strong algorithmic regularization effects in practice.
We make comparisons of the implicit regularization afforded by (unregularized) average SGD with the explicit regularization of ridge regression.
arXiv Detail & Related papers (2021-08-10T09:56:47Z) - A general sample complexity analysis of vanilla policy gradient [101.16957584135767]
Policy gradient (PG) is one of the most popular reinforcement learning (RL) problems.
"vanilla" theoretical understanding of PG trajectory is one of the most popular methods for solving RL problems.
arXiv Detail & Related papers (2021-07-23T19:38:17Z) - Benign Overfitting of Constant-Stepsize SGD for Linear Regression [122.70478935214128]
inductive biases are central in preventing overfitting empirically.
This work considers this issue in arguably the most basic setting: constant-stepsize SGD for linear regression.
We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares.
arXiv Detail & Related papers (2021-03-23T17:15:53Z) - SGD Generalizes Better Than GD (And Regularization Doesn't Help) [39.588906680621825]
We give a new separation result between the generalization performance of gradient descent (SGD) and of full-batch gradient descent (GD)
We show that with the same number of steps GD may overfit and emit a solution with $Omega(1)$ generalization error.
We discuss how regularizing the empirical risk minimized by GD essentially does not change the above result, and revisit the concepts of stability, implicit bias and the role of the learning algorithm in generalization.
arXiv Detail & Related papers (2021-02-01T19:18:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.