The Benefits of Implicit Regularization from SGD in Least Squares
Problems
- URL: http://arxiv.org/abs/2108.04552v1
- Date: Tue, 10 Aug 2021 09:56:47 GMT
- Title: The Benefits of Implicit Regularization from SGD in Least Squares
Problems
- Authors: Difan Zou and Jingfeng Wu and Vladimir Braverman and Quanquan Gu and
Dean P. Foster and Sham M. Kakade
- Abstract summary: gradient descent (SGD) exhibits strong algorithmic regularization effects in practice.
We make comparisons of the implicit regularization afforded by (unregularized) average SGD with the explicit regularization of ridge regression.
- Score: 116.85246178212616
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Stochastic gradient descent (SGD) exhibits strong algorithmic regularization
effects in practice, which has been hypothesized to play an important role in
the generalization of modern machine learning approaches. In this work, we seek
to understand these issues in the simpler setting of linear regression
(including both underparameterized and overparameterized regimes), where our
goal is to make sharp instance-based comparisons of the implicit regularization
afforded by (unregularized) average SGD with the explicit regularization of
ridge regression. For a broad class of least squares problem instances (that
are natural in high-dimensional settings), we show: (1) for every problem
instance and for every ridge parameter, (unregularized) SGD, when provided with
logarithmically more samples than that provided to the ridge algorithm,
generalizes no worse than the ridge solution (provided SGD uses a tuned
constant stepsize); (2) conversely, there exist instances (in this wide problem
class) where optimally-tuned ridge regression requires quadratically more
samples than SGD in order to have the same generalization performance. Taken
together, our results show that, up to the logarithmic factors, the
generalization performance of SGD is always no worse than that of ridge
regression in a wide range of overparameterized problems, and, in fact, could
be much better for some problem instances. More generally, our results show how
algorithmic regularization has important consequences even in simpler
(overparameterized) convex settings.
Related papers
- Improving Implicit Regularization of SGD with Preconditioning for Least Square Problems [19.995877680083105]
We study the generalization performance of gradient descent (SGD) with preconditioning for the least squared problem.
We show that our proposed preconditioning matrix is straightforward enough to allow robust estimation from finite samples.
arXiv Detail & Related papers (2024-03-13T14:42:06Z) - Risk Bounds of Accelerated SGD for Overparameterized Linear Regression [75.27846230182885]
Accelerated gradient descent (ASGD) is a workhorse in deep learning.
Existing optimization theory can only explain the faster convergence of ASGD, but cannot explain its better generalization.
arXiv Detail & Related papers (2023-11-23T23:02:10Z) - A Unified Approach to Controlling Implicit Regularization via Mirror
Descent [18.536453909759544]
Mirror descent (MD) is a notable generalization of gradient descent (GD)
We show that MD can be implemented efficiently and enjoys fast convergence under suitable conditions.
arXiv Detail & Related papers (2023-06-24T03:57:26Z) - Implicit Regularization or Implicit Conditioning? Exact Risk
Trajectories of SGD in High Dimensions [26.782342518986503]
gradient descent (SGD) is a pillar of modern machine learning, serving as the go-to optimization algorithm for a diverse array of problems.
We show how to adapt the HSGD formalism to include streaming SGD, which allows us to produce an exact prediction for the excess risk of multi-pass SGD relative to that of streaming SGD.
arXiv Detail & Related papers (2022-06-15T02:32:26Z) - Risk Bounds of Multi-Pass SGD for Least Squares in the Interpolation
Regime [127.21287240963859]
gradient descent (SGD) has achieved great success due to its superior performance in both optimization and generalization.
This paper aims to sharply characterize the generalization of multi-pass SGD.
We show that although SGD needs more than GD to achieve the same level of excess risk, it saves the number of gradient evaluations.
arXiv Detail & Related papers (2022-03-07T06:34:53Z) - Benign Underfitting of Stochastic Gradient Descent [72.38051710389732]
We study to what extent may gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit training data.
We analyze the closely related with-replacement SGD, for which an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate.
arXiv Detail & Related papers (2022-02-27T13:25:01Z) - Last Iterate Risk Bounds of SGD with Decaying Stepsize for
Overparameterized Linear Regression [122.70478935214128]
gradient descent (SGD) has been demonstrated to generalize well in many deep learning applications.
This paper provides problem-dependent analysis on the last iterate risk bounds of SGD with decaying stepsize.
arXiv Detail & Related papers (2021-10-12T17:49:54Z) - Benign Overfitting of Constant-Stepsize SGD for Linear Regression [122.70478935214128]
inductive biases are central in preventing overfitting empirically.
This work considers this issue in arguably the most basic setting: constant-stepsize SGD for linear regression.
We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares.
arXiv Detail & Related papers (2021-03-23T17:15:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.