Benign Overfitting of Constant-Stepsize SGD for Linear Regression
- URL: http://arxiv.org/abs/2103.12692v1
- Date: Tue, 23 Mar 2021 17:15:53 GMT
- Title: Benign Overfitting of Constant-Stepsize SGD for Linear Regression
- Authors: Difan Zou and Jingfeng Wu and Vladimir Braverman and Quanquan Gu and
Sham M. Kakade
- Abstract summary: inductive biases are central in preventing overfitting empirically.
This work considers this issue in arguably the most basic setting: constant-stepsize SGD for linear regression.
We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares.
- Score: 122.70478935214128
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There is an increasing realization that algorithmic inductive biases are
central in preventing overfitting; empirically, we often see a benign
overfitting phenomenon in overparameterized settings for natural learning
algorithms, such as stochastic gradient descent (SGD), where little to no
explicit regularization has been employed. This work considers this issue in
arguably the most basic setting: constant-stepsize SGD (with iterate averaging)
for linear regression in the overparameterized regime. Our main result provides
a sharp excess risk bound, stated in terms of the full eigenspectrum of the
data covariance matrix, that reveals a bias-variance decomposition
characterizing when generalization is possible: (i) the variance bound is
characterized in terms of an effective dimension (specific for SGD) and (ii)
the bias bound provides a sharp geometric characterization in terms of the
location of the initial iterate (and how it aligns with the data covariance
matrix). We reflect on a number of notable differences between the algorithmic
regularization afforded by (unregularized) SGD in comparison to ordinary least
squares (minimum-norm interpolation) and ridge regression.
Related papers
- Benign overfitting in Fixed Dimension via Physics-Informed Learning with Smooth Inductive Bias [8.668428992331808]
We develop an Sobolev norm learning curve for kernel ridge(less) regression when addressing (elliptical) linear inverse problems.
Our results show that the PDE operators in the inverse problem can stabilize the variance and even behave benign overfitting for fixed-dimensional problems.
arXiv Detail & Related papers (2024-06-13T14:54:30Z) - High-Dimensional Kernel Methods under Covariate Shift: Data-Dependent Implicit Regularization [83.06112052443233]
This paper studies kernel ridge regression in high dimensions under covariate shifts.
By a bias-variance decomposition, we theoretically demonstrate that the re-weighting strategy allows for decreasing the variance.
For bias, we analyze the regularization of the arbitrary or well-chosen scale, showing that the bias can behave very differently under different regularization scales.
arXiv Detail & Related papers (2024-06-05T12:03:27Z) - Risk Bounds of Accelerated SGD for Overparameterized Linear Regression [75.27846230182885]
Accelerated gradient descent (ASGD) is a workhorse in deep learning.
Existing optimization theory can only explain the faster convergence of ASGD, but cannot explain its better generalization.
arXiv Detail & Related papers (2023-11-23T23:02:10Z) - A Unified Approach to Controlling Implicit Regularization via Mirror
Descent [18.536453909759544]
Mirror descent (MD) is a notable generalization of gradient descent (GD)
We show that MD can be implemented efficiently and enjoys fast convergence under suitable conditions.
arXiv Detail & Related papers (2023-06-24T03:57:26Z) - Benign Underfitting of Stochastic Gradient Descent [72.38051710389732]
We study to what extent may gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit training data.
We analyze the closely related with-replacement SGD, for which an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate.
arXiv Detail & Related papers (2022-02-27T13:25:01Z) - On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD)
We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting.
We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z) - The Benefits of Implicit Regularization from SGD in Least Squares
Problems [116.85246178212616]
gradient descent (SGD) exhibits strong algorithmic regularization effects in practice.
We make comparisons of the implicit regularization afforded by (unregularized) average SGD with the explicit regularization of ridge regression.
arXiv Detail & Related papers (2021-08-10T09:56:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.