Risk Bounds of Accelerated SGD for Overparameterized Linear Regression
- URL: http://arxiv.org/abs/2311.14222v1
- Date: Thu, 23 Nov 2023 23:02:10 GMT
- Title: Risk Bounds of Accelerated SGD for Overparameterized Linear Regression
- Authors: Xuheng Li and Yihe Deng and Jingfeng Wu and Dongruo Zhou and Quanquan
Gu
- Abstract summary: Accelerated gradient descent (ASGD) is a workhorse in deep learning.
Existing optimization theory can only explain the faster convergence of ASGD, but cannot explain its better generalization.
- Score: 75.27846230182885
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accelerated stochastic gradient descent (ASGD) is a workhorse in deep
learning and often achieves better generalization performance than SGD.
However, existing optimization theory can only explain the faster convergence
of ASGD, but cannot explain its better generalization. In this paper, we study
the generalization of ASGD for overparameterized linear regression, which is
possibly the simplest setting of learning with overparameterization. We
establish an instance-dependent excess risk bound for ASGD within each
eigen-subspace of the data covariance matrix. Our analysis shows that (i) ASGD
outperforms SGD in the subspace of small eigenvalues, exhibiting a faster rate
of exponential decay for bias error, while in the subspace of large
eigenvalues, its bias error decays slower than SGD; and (ii) the variance error
of ASGD is always larger than that of SGD. Our result suggests that ASGD can
outperform SGD when the difference between the initialization and the true
weight vector is mostly confined to the subspace of small eigenvalues.
Additionally, when our analysis is specialized to linear regression in the
strongly convex setting, it yields a tighter bound for bias error than the
best-known result.
Related papers
- Implicit Regularization or Implicit Conditioning? Exact Risk
Trajectories of SGD in High Dimensions [26.782342518986503]
gradient descent (SGD) is a pillar of modern machine learning, serving as the go-to optimization algorithm for a diverse array of problems.
We show how to adapt the HSGD formalism to include streaming SGD, which allows us to produce an exact prediction for the excess risk of multi-pass SGD relative to that of streaming SGD.
arXiv Detail & Related papers (2022-06-15T02:32:26Z) - Risk Bounds of Multi-Pass SGD for Least Squares in the Interpolation
Regime [127.21287240963859]
gradient descent (SGD) has achieved great success due to its superior performance in both optimization and generalization.
This paper aims to sharply characterize the generalization of multi-pass SGD.
We show that although SGD needs more than GD to achieve the same level of excess risk, it saves the number of gradient evaluations.
arXiv Detail & Related papers (2022-03-07T06:34:53Z) - The Benefits of Implicit Regularization from SGD in Least Squares
Problems [116.85246178212616]
gradient descent (SGD) exhibits strong algorithmic regularization effects in practice.
We make comparisons of the implicit regularization afforded by (unregularized) average SGD with the explicit regularization of ridge regression.
arXiv Detail & Related papers (2021-08-10T09:56:47Z) - Benign Overfitting of Constant-Stepsize SGD for Linear Regression [122.70478935214128]
inductive biases are central in preventing overfitting empirically.
This work considers this issue in arguably the most basic setting: constant-stepsize SGD for linear regression.
We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares.
arXiv Detail & Related papers (2021-03-23T17:15:53Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - When Does Preconditioning Help or Hurt Generalization? [74.25170084614098]
We show how the textitimplicit bias of first and second order methods affects the comparison of generalization properties.
We discuss several approaches to manage the bias-variance tradeoff, and the potential benefit of interpolating between GD and NGD.
arXiv Detail & Related papers (2020-06-18T17:57:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.