Related papers: Hitting the High-Dimensional Notes: An ODE for SGD learning dynamics on GLMs and multi-index models

Hitting the High-Dimensional Notes: An ODE for SGD learning dynamics on GLMs and multi-index models

URL: http://arxiv.org/abs/2308.08977v1
Date: Thu, 17 Aug 2023 13:33:02 GMT
Title: Hitting the High-Dimensional Notes: An ODE for SGD learning dynamics on GLMs and multi-index models
Authors: Elizabeth Collins-Woodfin, Courtney Paquette, Elliot Paquette, Inbar Seroussi
Abstract summary: We analyze the dynamics of streaming gradient descent (SGD) in the high-dimensional limit. We demonstrate a deterministic equivalent of SGD in the form of a system of ordinary differential equations. In addition to the deterministic equivalent, we introduce an SDE with a simplified diffusion coefficient.
Score: 10.781866671930857
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We analyze the dynamics of streaming stochastic gradient descent (SGD) in the high-dimensional limit when applied to generalized linear models and multi-index models (e.g. logistic regression, phase retrieval) with general data-covariance. In particular, we demonstrate a deterministic equivalent of SGD in the form of a system of ordinary differential equations that describes a wide class of statistics, such as the risk and other measures of sub-optimality. This equivalence holds with overwhelming probability when the model parameter count grows proportionally to the number of data. This framework allows us to obtain learning rate thresholds for stability of SGD as well as convergence guarantees. In addition to the deterministic equivalent, we introduce an SDE with a simplified diffusion coefficient (homogenized SGD) which allows us to analyze the dynamics of general statistics of SGD iterates. Finally, we illustrate this theory on some standard examples and show numerical simulations which give an excellent match to the theory.

Related papers

Stochastic Differential Equations models for Least-Squares Stochastic Gradient Descent [6.3151583550712065]
We study the dynamics of a continuous-time model of the Gradient Descent (SGD) We analyze degenerate Differential Equations (squareSDEs) that model SGD either in the case of the training loss (finite samples) or the population one (online setting)
arXiv Detail & Related papers (2024-07-02T14:52:21Z)
On the Trajectory Regularity of ODE-based Diffusion Sampling [79.17334230868693]
Diffusion-based generative models use differential equations to establish a smooth connection between a complex data distribution and a tractable prior distribution. In this paper, we identify several intriguing trajectory properties in the ODE-based sampling process of diffusion models.
arXiv Detail & Related papers (2024-05-18T15:59:41Z)
A Geometric Perspective on Diffusion Models [57.27857591493788]
We inspect the ODE-based sampling of a popular variance-exploding SDE. We establish a theoretical relationship between the optimal ODE-based sampling and the classic mean-shift (mode-seeking) algorithm.
arXiv Detail & Related papers (2023-05-31T15:33:16Z)
Acceleration of stochastic gradient descent with momentum by averaging: finite-sample rates and asymptotic normality [7.793751573538388]
gradient descent with momentum (SGDM) has been widely used in many machine learning and statistical applications. We analyze the finite-sample convergence rate of SGDM under the strongly convex settings. We show that, with a large batch size, the mini-batch SGDM converges faster than the mini-batch SGD to a neighborhood of the optimal value.
arXiv Detail & Related papers (2023-05-28T08:49:24Z)
Rigorous dynamical mean field theory for stochastic gradient descent methods [17.90683687731009]
We prove closed-form equations for the exact high-dimensionals of a family of first order gradient-based methods. This includes widely used algorithms such as gradient descent (SGD) or Nesterov acceleration.
arXiv Detail & Related papers (2022-10-12T21:10:55Z)
NAG-GS: Semi-Implicit, Accelerated and Robust Stochastic Optimizer [45.47667026025716]
We propose a novel, robust and accelerated iteration that relies on two key elements. The convergence and stability of the obtained method, referred to as NAG-GS, are first studied extensively. We show that NAG-arity is competitive with state-the-art methods such as momentum SGD with weight decay and AdamW for the training of machine learning models.
arXiv Detail & Related papers (2022-09-29T16:54:53Z)
On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD) We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting. We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z)
Benign Overfitting of Constant-Stepsize SGD for Linear Regression [122.70478935214128]
inductive biases are central in preventing overfitting empirically. This work considers this issue in arguably the most basic setting: constant-stepsize SGD for linear regression. We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares.
arXiv Detail & Related papers (2021-03-23T17:15:53Z)
On the Generalization of Stochastic Gradient Descent with Momentum [58.900860437254885]
We first show that there exists a convex loss function for which algorithmic stability fails to establish generalization guarantees. For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, and show that it admits an upper-bound on the generalization error. For the special case of strongly convex loss functions, we find a range of momentum such that multiple epochs of standard SGDM, as a special form of SGDEM, also generalizes.
arXiv Detail & Related papers (2021-02-26T18:58:29Z)
SGD in the Large: Average-case Analysis, Asymptotics, and Stepsize Criticality [15.640534097470923]
We propose a new framework for analyzing the dynamics of gradient descent (SGD) when both number of samples and dimensions are large. Using this new framework, we show that the dynamics of SGD on a least squares problem with random data become deterministic in the large sample and dimensional limit.
arXiv Detail & Related papers (2021-02-08T18:00:13Z)
On the Generalization of Stochastic Gradient Descent with Momentum [84.54924994010703]
momentum-based accelerated variants of gradient descent (SGD) are widely used when training machine learning models. We first show that there exists a convex loss function for which the stability gap for multiple epochs of SGD with standard heavy-ball momentum (SGDM) becomes unbounded. For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, i.e., SGD with early momentum (SGDEM) under a broad range of step-sizes.
arXiv Detail & Related papers (2018-09-12T17:02:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.