Implicit Regularization or Implicit Conditioning? Exact Risk
Trajectories of SGD in High Dimensions
- URL: http://arxiv.org/abs/2206.07252v1
- Date: Wed, 15 Jun 2022 02:32:26 GMT
- Title: Implicit Regularization or Implicit Conditioning? Exact Risk
Trajectories of SGD in High Dimensions
- Authors: Courtney Paquette, Elliot Paquette, Ben Adlam, Jeffrey Pennington
- Abstract summary: gradient descent (SGD) is a pillar of modern machine learning, serving as the go-to optimization algorithm for a diverse array of problems.
We show how to adapt the HSGD formalism to include streaming SGD, which allows us to produce an exact prediction for the excess risk of multi-pass SGD relative to that of streaming SGD.
- Score: 26.782342518986503
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Stochastic gradient descent (SGD) is a pillar of modern machine learning,
serving as the go-to optimization algorithm for a diverse array of problems.
While the empirical success of SGD is often attributed to its computational
efficiency and favorable generalization behavior, neither effect is well
understood and disentangling them remains an open problem. Even in the simple
setting of convex quadratic problems, worst-case analyses give an asymptotic
convergence rate for SGD that is no better than full-batch gradient descent
(GD), and the purported implicit regularization effects of SGD lack a precise
explanation. In this work, we study the dynamics of multi-pass SGD on
high-dimensional convex quadratics and establish an asymptotic equivalence to a
stochastic differential equation, which we call homogenized stochastic gradient
descent (HSGD), whose solutions we characterize explicitly in terms of a
Volterra integral equation. These results yield precise formulas for the
learning and risk trajectories, which reveal a mechanism of implicit
conditioning that explains the efficiency of SGD relative to GD. We also prove
that the noise from SGD negatively impacts generalization performance, ruling
out the possibility of any type of implicit regularization in this context.
Finally, we show how to adapt the HSGD formalism to include streaming SGD,
which allows us to produce an exact prediction for the excess risk of
multi-pass SGD relative to that of streaming SGD (bootstrap risk).
Related papers
- The Optimality of (Accelerated) SGD for High-Dimensional Quadratic Optimization [4.7256945641654164]
gradient descent (SGD) is a widely used algorithm in machine learning, particularly for neural network training.
Recent studies on SGD for canonical quadratic optimization or linear regression show it attains well generalization under suitable high-dimensional settings.
This paper investigates SGD with two essential components in practice: exponentially decaying step size schedule and momentum.
arXiv Detail & Related papers (2024-09-15T14:20:03Z) - On the Trajectories of SGD Without Replacement [0.0]
This article examines the implicit regularization effect of Gradient Descent (SGD)
We consider the case of SGD without replacement, the variant typically used to optimize large-scale neural networks.
arXiv Detail & Related papers (2023-12-26T18:06:48Z) - Risk Bounds of Accelerated SGD for Overparameterized Linear Regression [75.27846230182885]
Accelerated gradient descent (ASGD) is a workhorse in deep learning.
Existing optimization theory can only explain the faster convergence of ASGD, but cannot explain its better generalization.
arXiv Detail & Related papers (2023-11-23T23:02:10Z) - Risk Bounds of Multi-Pass SGD for Least Squares in the Interpolation
Regime [127.21287240963859]
gradient descent (SGD) has achieved great success due to its superior performance in both optimization and generalization.
This paper aims to sharply characterize the generalization of multi-pass SGD.
We show that although SGD needs more than GD to achieve the same level of excess risk, it saves the number of gradient evaluations.
arXiv Detail & Related papers (2022-03-07T06:34:53Z) - Benign Underfitting of Stochastic Gradient Descent [72.38051710389732]
We study to what extent may gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit training data.
We analyze the closely related with-replacement SGD, for which an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate.
arXiv Detail & Related papers (2022-02-27T13:25:01Z) - On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD)
We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting.
We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z) - The Benefits of Implicit Regularization from SGD in Least Squares
Problems [116.85246178212616]
gradient descent (SGD) exhibits strong algorithmic regularization effects in practice.
We make comparisons of the implicit regularization afforded by (unregularized) average SGD with the explicit regularization of ridge regression.
arXiv Detail & Related papers (2021-08-10T09:56:47Z) - Benign Overfitting of Constant-Stepsize SGD for Linear Regression [122.70478935214128]
inductive biases are central in preventing overfitting empirically.
This work considers this issue in arguably the most basic setting: constant-stepsize SGD for linear regression.
We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares.
arXiv Detail & Related papers (2021-03-23T17:15:53Z) - SGD Generalizes Better Than GD (And Regularization Doesn't Help) [39.588906680621825]
We give a new separation result between the generalization performance of gradient descent (SGD) and of full-batch gradient descent (GD)
We show that with the same number of steps GD may overfit and emit a solution with $Omega(1)$ generalization error.
We discuss how regularizing the empirical risk minimized by GD essentially does not change the above result, and revisit the concepts of stability, implicit bias and the role of the learning algorithm in generalization.
arXiv Detail & Related papers (2021-02-01T19:18:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.