Last Iterate Risk Bounds of SGD with Decaying Stepsize for
Overparameterized Linear Regression
- URL: http://arxiv.org/abs/2110.06198v1
- Date: Tue, 12 Oct 2021 17:49:54 GMT
- Title: Last Iterate Risk Bounds of SGD with Decaying Stepsize for
Overparameterized Linear Regression
- Authors: Jingfeng Wu and Difan Zou and Vladimir Braverman and Quanquan Gu and
Sham M. Kakade
- Abstract summary: gradient descent (SGD) has been demonstrated to generalize well in many deep learning applications.
This paper provides problem-dependent analysis on the last iterate risk bounds of SGD with decaying stepsize.
- Score: 122.70478935214128
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Stochastic gradient descent (SGD) has been demonstrated to generalize well in
many deep learning applications. In practice, one often runs SGD with a
geometrically decaying stepsize, i.e., a constant initial stepsize followed by
multiple geometric stepsize decay, and uses the last iterate as the output.
This kind of SGD is known to be nearly minimax optimal for classical
finite-dimensional linear regression problems (Ge et al., 2019), and provably
outperforms SGD with polynomially decaying stepsize in terms of the statistical
minimax rates. However, a sharp analysis for the last iterate of SGD with
decaying step size in the overparameterized setting is still open. In this
paper, we provide problem-dependent analysis on the last iterate risk bounds of
SGD with decaying stepsize, for (overparameterized) linear regression problems.
In particular, for SGD with geometrically decaying stepsize (or tail
geometrically decaying stepsize), we prove nearly matching upper and lower
bounds on the excess risk. Our results demonstrate the generalization ability
of SGD for a wide class of overparameterized problems, and can recover the
minimax optimal results up to logarithmic factors in the classical regime.
Moreover, we provide an excess risk lower bound for SGD with polynomially
decaying stepsize and illustrate the advantage of geometrically decaying
stepsize in an instance-wise manner, which complements the minimax rate
comparison made in previous work.
Related papers
- Nonasymptotic Analysis of Stochastic Gradient Descent with the Richardson-Romberg Extrapolation [22.652143194356864]
We address the problem of solving strongly convex and smooth problems using a descent gradient (SGD) algorithm with a constant step size.
We provide an expansion of the mean-squared error of the resulting estimator with respect to the number iterations of $n$.
We establish that this chain is geometrically ergodic with respect to a defined weighted Wasserstein semimetric.
arXiv Detail & Related papers (2024-10-07T15:02:48Z) - Implicit Bias of Gradient Descent for Logistic Regression at the Edge of
Stability [69.01076284478151]
In machine learning optimization, gradient descent (GD) often operates at the edge of stability (EoS)
This paper studies the convergence and implicit bias of constant-stepsize GD for logistic regression on linearly separable data in the EoS regime.
arXiv Detail & Related papers (2023-05-19T16:24:47Z) - The Benefits of Implicit Regularization from SGD in Least Squares
Problems [116.85246178212616]
gradient descent (SGD) exhibits strong algorithmic regularization effects in practice.
We make comparisons of the implicit regularization afforded by (unregularized) average SGD with the explicit regularization of ridge regression.
arXiv Detail & Related papers (2021-08-10T09:56:47Z) - Benign Overfitting of Constant-Stepsize SGD for Linear Regression [122.70478935214128]
inductive biases are central in preventing overfitting empirically.
This work considers this issue in arguably the most basic setting: constant-stepsize SGD for linear regression.
We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares.
arXiv Detail & Related papers (2021-03-23T17:15:53Z) - SGD in the Large: Average-case Analysis, Asymptotics, and Stepsize
Criticality [15.640534097470923]
We propose a new framework for analyzing the dynamics of gradient descent (SGD) when both number of samples and dimensions are large.
Using this new framework, we show that the dynamics of SGD on a least squares problem with random data become deterministic in the large sample and dimensional limit.
arXiv Detail & Related papers (2021-02-08T18:00:13Z) - Balancing Rates and Variance via Adaptive Batch-Size for Stochastic
Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error.
Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z) - On the Almost Sure Convergence of Stochastic Gradient Descent in
Non-Convex Problems [75.58134963501094]
This paper analyzes the trajectories of gradient descent (SGD)
We show that SGD avoids saddle points/manifolds with $1$ for strict step-size policies.
arXiv Detail & Related papers (2020-06-19T14:11:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.