Revisiting the Noise Model of Stochastic Gradient Descent
- URL: http://arxiv.org/abs/2303.02749v1
- Date: Sun, 5 Mar 2023 18:55:12 GMT
- Title: Revisiting the Noise Model of Stochastic Gradient Descent
- Authors: Barak Battash and Ofir Lindenbaum
- Abstract summary: gradient noise (SGN) is a significant factor in the success of gradient descent.
We show that SGN is heavy-tailed and better depicted by the $Salpha S$ distribution.
- Score: 5.482532589225552
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The stochastic gradient noise (SGN) is a significant factor in the success of
stochastic gradient descent (SGD). Following the central limit theorem, SGN was
initially modeled as Gaussian, and lately, it has been suggested that
stochastic gradient noise is better characterized using $S\alpha S$ L\'evy
distribution. This claim was allegedly refuted and rebounded to the previously
suggested Gaussian noise model. This paper presents solid, detailed empirical
evidence that SGN is heavy-tailed and better depicted by the $S\alpha S$
distribution. Furthermore, we argue that different parameters in a deep neural
network (DNN) hold distinct SGN characteristics throughout training. To more
accurately approximate the dynamics of SGD near a local minimum, we construct a
novel framework in $\mathbb{R}^N$, based on L\'evy-driven stochastic
differential equation (SDE), where one-dimensional L\'evy processes model each
parameter in the DNN. Next, we show that SGN jump intensity (frequency and
amplitude) depends on the learning rate decay mechanism (LRdecay); furthermore,
we demonstrate empirically that the LRdecay effect may stem from the reduction
of the SGN and not the decrease in the step size. Based on our analysis, we
examine the mean escape time, trapping probability, and more properties of DNNs
near local minima. Finally, we prove that the training process will likely exit
from the basin in the direction of parameters with heavier tail SGN. We will
share our code for reproducibility.
Related papers
- Noise in the reverse process improves the approximation capabilities of
diffusion models [27.65800389807353]
In Score based Generative Modeling (SGMs), the state-of-the-art in generative modeling, reverse processes are known to perform better than their deterministic counterparts.
This paper delves into the heart of this phenomenon, comparing neural ordinary differential equations (ODEs) and neural dimension equations (SDEs) as reverse processes.
We analyze the ability of neural SDEs to approximate trajectories of the Fokker-Planck equation, revealing the advantages of neurality.
arXiv Detail & Related papers (2023-12-13T02:39:10Z) - Convergence of mean-field Langevin dynamics: Time and space
discretization, stochastic gradient, and variance reduction [49.66486092259376]
The mean-field Langevin dynamics (MFLD) is a nonlinear generalization of the Langevin dynamics that incorporates a distribution-dependent drift.
Recent works have shown that MFLD globally minimizes an entropy-regularized convex functional in the space of measures.
We provide a framework to prove a uniform-in-time propagation of chaos for MFLD that takes into account the errors due to finite-particle approximation, time-discretization, and gradient approximation.
arXiv Detail & Related papers (2023-06-12T16:28:11Z) - Understanding Long Range Memory Effects in Deep Neural Networks [10.616643031188248]
textitstochastic gradient descent (SGD) is of fundamental importance in deep learning.
In this study, we argue that SGN is neither Gaussian nor stable. Instead, we propose that SGD can be viewed as a discretization of an SDE driven by textitfractional Brownian motion (FBM)
arXiv Detail & Related papers (2021-05-05T13:54:26Z) - Positive-Negative Momentum: Manipulating Stochastic Gradient Noise to
Improve Generalization [89.7882166459412]
gradient noise (SGN) acts as implicit regularization for deep learning.
Some works attempted to artificially simulate SGN by injecting random noise to improve deep learning.
For simulating SGN at low computational costs and without changing the learning rate or batch size, we propose the Positive-Negative Momentum (PNM) approach.
arXiv Detail & Related papers (2021-03-31T16:08:06Z) - Asymmetric Heavy Tails and Implicit Bias in Gaussian Noise Injections [73.95786440318369]
We focus on the so-called implicit effect' of GNIs, which is the effect of the injected noise on the dynamics of gradient descent (SGD)
We show that this effect induces an asymmetric heavy-tailed noise on gradient updates.
We then formally prove that GNIs induce an implicit bias', which varies depending on the heaviness of the tails and the level of asymmetry.
arXiv Detail & Related papers (2021-02-13T21:28:09Z) - Faster Convergence of Stochastic Gradient Langevin Dynamics for
Non-Log-Concave Sampling [110.88857917726276]
We provide a new convergence analysis of gradient Langevin dynamics (SGLD) for sampling from a class of distributions that can be non-log-concave.
At the core of our approach is a novel conductance analysis of SGLD using an auxiliary time-reversible Markov Chain.
arXiv Detail & Related papers (2020-10-19T15:23:18Z) - Dynamic of Stochastic Gradient Descent with State-Dependent Noise [84.64013284862733]
gradient descent (SGD) and its variants are mainstream methods to train deep neural networks.
We show that the covariance of the noise of SGD in the local region of the local minima is a quadratic function of the state.
We propose a novel power-law dynamic with state-dependent diffusion to approximate the dynamic of SGD.
arXiv Detail & Related papers (2020-06-24T13:34:38Z) - Optimal Rates for Averaged Stochastic Gradient Descent under Neural
Tangent Kernel Regime [50.510421854168065]
We show that the averaged gradient descent can achieve the minimax optimal convergence rate.
We show that the target function specified by the NTK of a ReLU network can be learned at the optimal convergence rate.
arXiv Detail & Related papers (2020-06-22T14:31:37Z) - On the Promise of the Stochastic Generalized Gauss-Newton Method for
Training DNNs [37.96456928567548]
We study a generalized Gauss-Newton method (SGN) for training DNNs.
SGN is a second-order optimization method, with efficient iterations, that we demonstrate to often require substantially fewer iterations than standard SGD to converge.
We show that SGN does not only substantially improve over SGD in terms of the number of iterations, but also in terms of runtime.
This is made possible by an efficient, easy-to-use and flexible implementation of SGN we propose in the Theano deep learning platform.
arXiv Detail & Related papers (2020-06-03T17:35:54Z) - Fractional Underdamped Langevin Dynamics: Retargeting SGD with Momentum
under Heavy-Tailed Gradient Noise [39.9241638707715]
We show that FULD has similarities with enatural and egradient methods on their role in deep learning.
arXiv Detail & Related papers (2020-02-13T18:04:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.