High-dimensional limit theorems for SGD: Effective dynamics and critical
scaling
- URL: http://arxiv.org/abs/2206.04030v4
- Date: Thu, 17 Aug 2023 20:05:36 GMT
- Title: High-dimensional limit theorems for SGD: Effective dynamics and critical
scaling
- Authors: Gerard Ben Arous, Reza Gheissari, and Aukosh Jagannath
- Abstract summary: We prove limit theorems for the trajectories of summary statistics of gradient descent (SGD)
We show a critical scaling regime for the step-size, below which the effective ballistic dynamics matches gradient flow for the population loss.
About the fixed points of this effective dynamics, the corresponding diffusive limits can be quite complex and even degenerate.
- Score: 6.950316788263433
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the scaling limits of stochastic gradient descent (SGD) with
constant step-size in the high-dimensional regime. We prove limit theorems for
the trajectories of summary statistics (i.e., finite-dimensional functions) of
SGD as the dimension goes to infinity. Our approach allows one to choose the
summary statistics that are tracked, the initialization, and the step-size. It
yields both ballistic (ODE) and diffusive (SDE) limits, with the limit
depending dramatically on the former choices. We show a critical scaling regime
for the step-size, below which the effective ballistic dynamics matches
gradient flow for the population loss, but at which, a new correction term
appears which changes the phase diagram. About the fixed points of this
effective dynamics, the corresponding diffusive limits can be quite complex and
even degenerate. We demonstrate our approach on popular examples including
estimation for spiked matrix and tensor models and classification via two-layer
networks for binary and XOR-type Gaussian mixture models. These examples
exhibit surprising phenomena including multimodal timescales to convergence as
well as convergence to sub-optimal solutions with probability bounded away from
zero from random (e.g., Gaussian) initializations. At the same time, we
demonstrate the benefit of overparametrization by showing that the latter
probability goes to zero as the second layer width grows.
Related papers
- Large data limits and scaling laws for tSNE [1.2085509610251701]
We show that embeddings of the original tSNE algorithm cannot have any consistent limit as $n to in$.
We propose a rescaled model which mitigates the decay of the attractive energy, and which does have a consistent limit.
arXiv Detail & Related papers (2024-10-16T21:43:02Z) - A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks.
We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks.
Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z) - A U-turn on Double Descent: Rethinking Parameter Counting in Statistical
Learning [68.76846801719095]
We show that double descent appears exactly when and where it occurs, and that its location is not inherently tied to the threshold p=n.
This provides a resolution to tensions between double descent and statistical intuition.
arXiv Detail & Related papers (2023-10-29T12:05:39Z) - Implicit Bias of Gradient Descent for Logistic Regression at the Edge of
Stability [69.01076284478151]
In machine learning optimization, gradient descent (GD) often operates at the edge of stability (EoS)
This paper studies the convergence and implicit bias of constant-stepsize GD for logistic regression on linearly separable data in the EoS regime.
arXiv Detail & Related papers (2023-05-19T16:24:47Z) - High-dimensional scaling limits and fluctuations of online least-squares SGD with smooth covariance [16.652085114513273]
We derive high-dimensional scaling limits and fluctuations for the online least-squares Gradient Descent (SGD) algorithm.
Our results have several applications, including characterization of the limiting mean-square estimation or prediction errors and their fluctuations.
arXiv Detail & Related papers (2023-04-03T03:50:00Z) - On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD)
We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting.
We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z) - SGD in the Large: Average-case Analysis, Asymptotics, and Stepsize
Criticality [15.640534097470923]
We propose a new framework for analyzing the dynamics of gradient descent (SGD) when both number of samples and dimensions are large.
Using this new framework, we show that the dynamics of SGD on a least squares problem with random data become deterministic in the large sample and dimensional limit.
arXiv Detail & Related papers (2021-02-08T18:00:13Z) - Last iterate convergence of SGD for Least-Squares in the Interpolation
regime [19.05750582096579]
We study the noiseless model in the fundamental least-squares setup.
We assume that an optimum predictor fits perfectly inputs and outputs $langle theta_*, phi(X) rangle = Y$, where $phi(X)$ stands for a possibly infinite dimensional non-linear feature map.
arXiv Detail & Related papers (2021-02-05T14:02:20Z) - A Dynamical Central Limit Theorem for Shallow Neural Networks [48.66103132697071]
We prove that the fluctuations around the mean limit remain bounded in mean square throughout training.
If the mean-field dynamics converges to a measure that interpolates the training data, we prove that the deviation eventually vanishes in the CLT scaling.
arXiv Detail & Related papers (2020-08-21T18:00:50Z) - Balancing Rates and Variance via Adaptive Batch-Size for Stochastic
Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error.
Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z) - Dynamical mean-field theory for stochastic gradient descent in Gaussian
mixture classification [25.898873960635534]
We analyze in a closed learning dynamics of gradient descent (SGD) for a single-layer neural network classifying a high-dimensional landscape.
We define a prototype process for which can be extended to a continuous-dimensional gradient flow.
In the full-batch limit, we recover the standard gradient flow.
arXiv Detail & Related papers (2020-06-10T22:49:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.