Risk Bounds of Multi-Pass SGD for Least Squares in the Interpolation
Regime
- URL: http://arxiv.org/abs/2203.03159v1
- Date: Mon, 7 Mar 2022 06:34:53 GMT
- Title: Risk Bounds of Multi-Pass SGD for Least Squares in the Interpolation
Regime
- Authors: Difan Zou and Jingfeng Wu and Vladimir Braverman and Quanquan Gu and
Sham M. Kakade
- Abstract summary: gradient descent (SGD) has achieved great success due to its superior performance in both optimization and generalization.
This paper aims to sharply characterize the generalization of multi-pass SGD.
We show that although SGD needs more than GD to achieve the same level of excess risk, it saves the number of gradient evaluations.
- Score: 127.21287240963859
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Stochastic gradient descent (SGD) has achieved great success due to its
superior performance in both optimization and generalization. Most of existing
generalization analyses are made for single-pass SGD, which is a less practical
variant compared to the commonly-used multi-pass SGD. Besides, theoretical
analyses for multi-pass SGD often concern a worst-case instance in a class of
problems, which may be pessimistic to explain the superior generalization
ability for some particular problem instance. The goal of this paper is to
sharply characterize the generalization of multi-pass SGD, by developing an
instance-dependent excess risk bound for least squares in the interpolation
regime, which is expressed as a function of the iteration number, stepsize, and
data covariance. We show that the excess risk of SGD can be exactly decomposed
into the excess risk of GD and a positive fluctuation error, suggesting that
SGD always performs worse, instance-wisely, than GD, in generalization. On the
other hand, we show that although SGD needs more iterations than GD to achieve
the same level of excess risk, it saves the number of stochastic gradient
evaluations, and therefore is preferable in terms of computational time.
Related papers
- Risk Bounds of Accelerated SGD for Overparameterized Linear Regression [75.27846230182885]
Accelerated gradient descent (ASGD) is a workhorse in deep learning.
Existing optimization theory can only explain the faster convergence of ASGD, but cannot explain its better generalization.
arXiv Detail & Related papers (2023-11-23T23:02:10Z) - Implicit Regularization or Implicit Conditioning? Exact Risk
Trajectories of SGD in High Dimensions [26.782342518986503]
gradient descent (SGD) is a pillar of modern machine learning, serving as the go-to optimization algorithm for a diverse array of problems.
We show how to adapt the HSGD formalism to include streaming SGD, which allows us to produce an exact prediction for the excess risk of multi-pass SGD relative to that of streaming SGD.
arXiv Detail & Related papers (2022-06-15T02:32:26Z) - Benign Underfitting of Stochastic Gradient Descent [72.38051710389732]
We study to what extent may gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit training data.
We analyze the closely related with-replacement SGD, for which an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate.
arXiv Detail & Related papers (2022-02-27T13:25:01Z) - Last Iterate Risk Bounds of SGD with Decaying Stepsize for
Overparameterized Linear Regression [122.70478935214128]
gradient descent (SGD) has been demonstrated to generalize well in many deep learning applications.
This paper provides problem-dependent analysis on the last iterate risk bounds of SGD with decaying stepsize.
arXiv Detail & Related papers (2021-10-12T17:49:54Z) - The Benefits of Implicit Regularization from SGD in Least Squares
Problems [116.85246178212616]
gradient descent (SGD) exhibits strong algorithmic regularization effects in practice.
We make comparisons of the implicit regularization afforded by (unregularized) average SGD with the explicit regularization of ridge regression.
arXiv Detail & Related papers (2021-08-10T09:56:47Z) - Benign Overfitting of Constant-Stepsize SGD for Linear Regression [122.70478935214128]
inductive biases are central in preventing overfitting empirically.
This work considers this issue in arguably the most basic setting: constant-stepsize SGD for linear regression.
We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares.
arXiv Detail & Related papers (2021-03-23T17:15:53Z) - SGD Generalizes Better Than GD (And Regularization Doesn't Help) [39.588906680621825]
We give a new separation result between the generalization performance of gradient descent (SGD) and of full-batch gradient descent (GD)
We show that with the same number of steps GD may overfit and emit a solution with $Omega(1)$ generalization error.
We discuss how regularizing the empirical risk minimized by GD essentially does not change the above result, and revisit the concepts of stability, implicit bias and the role of the learning algorithm in generalization.
arXiv Detail & Related papers (2021-02-01T19:18:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.