Homogenization of SGD in high-dimensions: Exact dynamics and
generalization properties
- URL: http://arxiv.org/abs/2205.07069v1
- Date: Sat, 14 May 2022 14:10:08 GMT
- Title: Homogenization of SGD in high-dimensions: Exact dynamics and
generalization properties
- Authors: Courtney Paquette, Elliot Paquette, Ben Adlam, Jeffrey Pennington
- Abstract summary: We develop a differential equation, called homogenized SGD, for analyzing the dynamics of gradient descent vectors (SGD)
We show that homogenized SGD is the high-dimensional equivalence of SGD -- for any quadratic statistic (e.g., population risk with quadratic loss)
- Score: 26.782342518986503
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We develop a stochastic differential equation, called homogenized SGD, for
analyzing the dynamics of stochastic gradient descent (SGD) on a
high-dimensional random least squares problem with $\ell^2$-regularization. We
show that homogenized SGD is the high-dimensional equivalence of SGD -- for any
quadratic statistic (e.g., population risk with quadratic loss), the statistic
under the iterates of SGD converges to the statistic under homogenized SGD when
the number of samples $n$ and number of features $d$ are polynomially related
($d^c < n < d^{1/c}$ for some $c > 0$). By analyzing homogenized SGD, we
provide exact non-asymptotic high-dimensional expressions for the
generalization performance of SGD in terms of a solution of a Volterra integral
equation. Further we provide the exact value of the limiting excess risk in the
case of quadratic losses when trained by SGD. The analysis is formulated for
data matrices and target vectors that satisfy a family of resolvent conditions,
which can roughly be viewed as a weak (non-quantitative) form of delocalization
of sample-side singular vectors of the data. Several motivating applications
are provided including sample covariance matrices with independent samples and
random features with non-generative model targets.
Related papers
- Stochastic Differential Equations models for Least-Squares Stochastic Gradient Descent [6.3151583550712065]
We study the dynamics of a continuous-time model of the Gradient Descent (SGD)
We analyze degenerate Differential Equations (squareSDEs) that model SGD either in the case of the training loss (finite samples) or the population one (online setting)
arXiv Detail & Related papers (2024-07-02T14:52:21Z) - Computational-Statistical Gaps in Gaussian Single-Index Models [77.1473134227844]
Single-Index Models are high-dimensional regression problems with planted structure.
We show that computationally efficient algorithms, both within the Statistical Query (SQ) and the Low-Degree Polynomial (LDP) framework, necessarily require $Omega(dkstar/2)$ samples.
arXiv Detail & Related papers (2024-03-08T18:50:19Z) - Demystifying the Myths and Legends of Nonconvex Convergence of SGD [17.445810977264067]
gradient descent (SGD) and its variants are the main workhorses for solving large-scale optimization problems.
As our analyses, we addressed certain myths and legends related to the non convergence of the gradient.
arXiv Detail & Related papers (2023-10-19T17:58:59Z) - Hitting the High-Dimensional Notes: An ODE for SGD learning dynamics on
GLMs and multi-index models [10.781866671930857]
We analyze the dynamics of streaming gradient descent (SGD) in the high-dimensional limit.
We demonstrate a deterministic equivalent of SGD in the form of a system of ordinary differential equations.
In addition to the deterministic equivalent, we introduce an SDE with a simplified diffusion coefficient.
arXiv Detail & Related papers (2023-08-17T13:33:02Z) - Implicit Regularization or Implicit Conditioning? Exact Risk
Trajectories of SGD in High Dimensions [26.782342518986503]
gradient descent (SGD) is a pillar of modern machine learning, serving as the go-to optimization algorithm for a diverse array of problems.
We show how to adapt the HSGD formalism to include streaming SGD, which allows us to produce an exact prediction for the excess risk of multi-pass SGD relative to that of streaming SGD.
arXiv Detail & Related papers (2022-06-15T02:32:26Z) - Risk Bounds of Multi-Pass SGD for Least Squares in the Interpolation
Regime [127.21287240963859]
gradient descent (SGD) has achieved great success due to its superior performance in both optimization and generalization.
This paper aims to sharply characterize the generalization of multi-pass SGD.
We show that although SGD needs more than GD to achieve the same level of excess risk, it saves the number of gradient evaluations.
arXiv Detail & Related papers (2022-03-07T06:34:53Z) - Nonconvex Stochastic Scaled-Gradient Descent and Generalized Eigenvector
Problems [98.34292831923335]
Motivated by the problem of online correlation analysis, we propose the emphStochastic Scaled-Gradient Descent (SSD) algorithm.
We bring these ideas together in an application to online correlation analysis, deriving for the first time an optimal one-time-scale algorithm with an explicit rate of local convergence to normality.
arXiv Detail & Related papers (2021-12-29T18:46:52Z) - On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD)
We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting.
We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z) - The Benefits of Implicit Regularization from SGD in Least Squares
Problems [116.85246178212616]
gradient descent (SGD) exhibits strong algorithmic regularization effects in practice.
We make comparisons of the implicit regularization afforded by (unregularized) average SGD with the explicit regularization of ridge regression.
arXiv Detail & Related papers (2021-08-10T09:56:47Z) - Benign Overfitting of Constant-Stepsize SGD for Linear Regression [122.70478935214128]
inductive biases are central in preventing overfitting empirically.
This work considers this issue in arguably the most basic setting: constant-stepsize SGD for linear regression.
We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares.
arXiv Detail & Related papers (2021-03-23T17:15:53Z) - SGD in the Large: Average-case Analysis, Asymptotics, and Stepsize
Criticality [15.640534097470923]
We propose a new framework for analyzing the dynamics of gradient descent (SGD) when both number of samples and dimensions are large.
Using this new framework, we show that the dynamics of SGD on a least squares problem with random data become deterministic in the large sample and dimensional limit.
arXiv Detail & Related papers (2021-02-08T18:00:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.