Law of Balance and Stationary Distribution of Stochastic Gradient
Descent
- URL: http://arxiv.org/abs/2308.06671v1
- Date: Sun, 13 Aug 2023 03:13:03 GMT
- Title: Law of Balance and Stationary Distribution of Stochastic Gradient
Descent
- Authors: Liu Ziyin, Hongchao Li, Masahito Ueda
- Abstract summary: We prove that the minibatch noise of gradient descent (SGD) regularizes the solution towards a balanced solution whenever the loss function contains a rescaling symmetry.
We then derive the stationary distribution of gradient flow for a diagonal linear network with arbitrary depth and width.
These phenomena are shown to exist uniquely in deep networks, implying a fundamental difference between deep and shallow models.
- Score: 11.937085301750288
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The stochastic gradient descent (SGD) algorithm is the algorithm we use to
train neural networks. However, it remains poorly understood how the SGD
navigates the highly nonlinear and degenerate loss landscape of a neural
network. In this work, we prove that the minibatch noise of SGD regularizes the
solution towards a balanced solution whenever the loss function contains a
rescaling symmetry. Because the difference between a simple diffusion process
and SGD dynamics is the most significant when symmetries are present, our
theory implies that the loss function symmetries constitute an essential probe
of how SGD works. We then apply this result to derive the stationary
distribution of stochastic gradient flow for a diagonal linear network with
arbitrary depth and width. The stationary distribution exhibits complicated
nonlinear phenomena such as phase transitions, broken ergodicity, and
fluctuation inversion. These phenomena are shown to exist uniquely in deep
networks, implying a fundamental difference between deep and shallow models.
Related papers
- Weight fluctuations in (deep) linear neural networks and a derivation of the inverse-variance flatness relation [6.122833099916154]
We investigate the stationary (late-time) training regime of single- and two-layer underparameterized linear neural networks.
We identify the inter-layer coupling as a distinct source of anisotropy for the weight fluctuations.
We provide an analytical derivation of the recently observed inverse variance-flatness relation in a model of a deep linear neural network.
arXiv Detail & Related papers (2023-11-23T17:30:31Z) - Machine learning in and out of equilibrium [58.88325379746631]
Our study uses a Fokker-Planck approach, adapted from statistical physics, to explore these parallels.
We focus in particular on the stationary state of the system in the long-time limit, which in conventional SGD is out of equilibrium.
We propose a new variation of Langevin dynamics (SGLD) that harnesses without replacement minibatching.
arXiv Detail & Related papers (2023-06-06T09:12:49Z) - Implicit Bias of Gradient Descent for Logistic Regression at the Edge of
Stability [69.01076284478151]
In machine learning optimization, gradient descent (GD) often operates at the edge of stability (EoS)
This paper studies the convergence and implicit bias of constant-stepsize GD for logistic regression on linearly separable data in the EoS regime.
arXiv Detail & Related papers (2023-05-19T16:24:47Z) - Implicit Stochastic Gradient Descent for Training Physics-informed
Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems.
PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features.
In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z) - SGD with Large Step Sizes Learns Sparse Features [22.959258640051342]
We showcase important features of the dynamics of the Gradient Descent (SGD) in the training of neural networks.
We show that the longer large step sizes keep SGD high in the loss landscape, the better the implicit regularization can operate and find sparse representations.
arXiv Detail & Related papers (2022-10-11T11:00:04Z) - Momentum Diminishes the Effect of Spectral Bias in Physics-Informed
Neural Networks [72.09574528342732]
Physics-informed neural network (PINN) algorithms have shown promising results in solving a wide range of problems involving partial differential equations (PDEs)
They often fail to converge to desirable solutions when the target function contains high-frequency features, due to a phenomenon known as spectral bias.
In the present work, we exploit neural tangent kernels (NTKs) to investigate the training dynamics of PINNs evolving under gradient descent with momentum (SGDM)
arXiv Detail & Related papers (2022-06-29T19:03:10Z) - Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent.
We show that SGD is biased towards a simple solution.
We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z) - On Convergence of Training Loss Without Reaching Stationary Points [62.41370821014218]
We show that Neural Network weight variables do not converge to stationary points where the gradient the loss function vanishes.
We propose a new perspective based on ergodic theory dynamical systems.
arXiv Detail & Related papers (2021-10-12T18:12:23Z) - The Limiting Dynamics of SGD: Modified Loss, Phase Space Oscillations,
and Anomalous Diffusion [29.489737359897312]
We study the limiting dynamics of deep neural networks trained with gradient descent (SGD)
We show that the key ingredient driving these dynamics is not the original training loss, but rather the combination of a modified loss, which implicitly regularizes the velocity and probability currents, which cause oscillations in phase space.
arXiv Detail & Related papers (2021-07-19T20:18:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.