The Limiting Dynamics of SGD: Modified Loss, Phase Space Oscillations,
and Anomalous Diffusion
- URL: http://arxiv.org/abs/2107.09133v4
- Date: Thu, 28 Dec 2023 17:48:28 GMT
- Title: The Limiting Dynamics of SGD: Modified Loss, Phase Space Oscillations,
and Anomalous Diffusion
- Authors: Daniel Kunin, Javier Sagastuy-Brena, Lauren Gillespie, Eshed Margalit,
Hidenori Tanaka, Surya Ganguli, Daniel L. K. Yamins
- Abstract summary: We study the limiting dynamics of deep neural networks trained with gradient descent (SGD)
We show that the key ingredient driving these dynamics is not the original training loss, but rather the combination of a modified loss, which implicitly regularizes the velocity and probability currents, which cause oscillations in phase space.
- Score: 29.489737359897312
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work we explore the limiting dynamics of deep neural networks trained
with stochastic gradient descent (SGD). As observed previously, long after
performance has converged, networks continue to move through parameter space by
a process of anomalous diffusion in which distance travelled grows as a power
law in the number of gradient updates with a nontrivial exponent. We reveal an
intricate interaction between the hyperparameters of optimization, the
structure in the gradient noise, and the Hessian matrix at the end of training
that explains this anomalous diffusion. To build this understanding, we first
derive a continuous-time model for SGD with finite learning rates and batch
sizes as an underdamped Langevin equation. We study this equation in the
setting of linear regression, where we can derive exact, analytic expressions
for the phase space dynamics of the parameters and their instantaneous
velocities from initialization to stationarity. Using the Fokker-Planck
equation, we show that the key ingredient driving these dynamics is not the
original training loss, but rather the combination of a modified loss, which
implicitly regularizes the velocity, and probability currents, which cause
oscillations in phase space. We identify qualitative and quantitative
predictions of this theory in the dynamics of a ResNet-18 model trained on
ImageNet. Through the lens of statistical physics, we uncover a mechanistic
origin for the anomalous limiting dynamics of deep neural networks trained with
SGD.
Related papers
- Law of Balance and Stationary Distribution of Stochastic Gradient
Descent [11.937085301750288]
We prove that the minibatch noise of gradient descent (SGD) regularizes the solution towards a balanced solution whenever the loss function contains a rescaling symmetry.
We then derive the stationary distribution of gradient flow for a diagonal linear network with arbitrary depth and width.
These phenomena are shown to exist uniquely in deep networks, implying a fundamental difference between deep and shallow models.
arXiv Detail & Related papers (2023-08-13T03:13:03Z) - Convergence of mean-field Langevin dynamics: Time and space
discretization, stochastic gradient, and variance reduction [49.66486092259376]
The mean-field Langevin dynamics (MFLD) is a nonlinear generalization of the Langevin dynamics that incorporates a distribution-dependent drift.
Recent works have shown that MFLD globally minimizes an entropy-regularized convex functional in the space of measures.
We provide a framework to prove a uniform-in-time propagation of chaos for MFLD that takes into account the errors due to finite-particle approximation, time-discretization, and gradient approximation.
arXiv Detail & Related papers (2023-06-12T16:28:11Z) - Machine learning in and out of equilibrium [58.88325379746631]
Our study uses a Fokker-Planck approach, adapted from statistical physics, to explore these parallels.
We focus in particular on the stationary state of the system in the long-time limit, which in conventional SGD is out of equilibrium.
We propose a new variation of Langevin dynamics (SGLD) that harnesses without replacement minibatching.
arXiv Detail & Related papers (2023-06-06T09:12:49Z) - Capturing dynamical correlations using implicit neural representations [85.66456606776552]
We develop an artificial intelligence framework which combines a neural network trained to mimic simulated data from a model Hamiltonian with automatic differentiation to recover unknown parameters from experimental data.
In doing so, we illustrate the ability to build and train a differentiable model only once, which then can be applied in real-time to multi-dimensional scattering data.
arXiv Detail & Related papers (2023-04-08T07:55:36Z) - Implicit Stochastic Gradient Descent for Training Physics-informed
Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems.
PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features.
In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z) - SGD with Large Step Sizes Learns Sparse Features [22.959258640051342]
We showcase important features of the dynamics of the Gradient Descent (SGD) in the training of neural networks.
We show that the longer large step sizes keep SGD high in the loss landscape, the better the implicit regularization can operate and find sparse representations.
arXiv Detail & Related papers (2022-10-11T11:00:04Z) - Momentum Diminishes the Effect of Spectral Bias in Physics-Informed
Neural Networks [72.09574528342732]
Physics-informed neural network (PINN) algorithms have shown promising results in solving a wide range of problems involving partial differential equations (PDEs)
They often fail to converge to desirable solutions when the target function contains high-frequency features, due to a phenomenon known as spectral bias.
In the present work, we exploit neural tangent kernels (NTKs) to investigate the training dynamics of PINNs evolving under gradient descent with momentum (SGDM)
arXiv Detail & Related papers (2022-06-29T19:03:10Z) - A PDE-based Explanation of Extreme Numerical Sensitivities and Edge of Stability in Training Neural Networks [12.355137704908042]
We show restrained numerical instabilities in current training practices of deep networks with gradient descent (SGD)
We do this by presenting a theoretical framework using numerical analysis of partial differential equations (PDE), and analyzing the gradient descent PDE of convolutional neural networks (CNNs)
We show this is a consequence of the non-linear PDE associated with the descent of the CNN, whose local linearization changes when over-driving the step size of the discretization resulting in a stabilizing effect.
arXiv Detail & Related papers (2022-06-04T14:54:05Z) - Liquid Time-constant Networks [117.57116214802504]
We introduce a new class of time-continuous recurrent neural network models.
Instead of declaring a learning system's dynamics by implicit nonlinearities, we construct networks of linear first-order dynamical systems.
These neural networks exhibit stable and bounded behavior, yield superior expressivity within the family of neural ordinary differential equations.
arXiv Detail & Related papers (2020-06-08T09:53:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.