Related papers: Revisiting the Characteristics of Stochastic Gradient Noise and Dynamics

Revisiting the Characteristics of Stochastic Gradient Noise and Dynamics

URL: http://arxiv.org/abs/2109.09833v1
Date: Mon, 20 Sep 2021 20:39:14 GMT
Title: Revisiting the Characteristics of Stochastic Gradient Noise and Dynamics
Authors: Yixin Wu and Rui Luo and Chen Zhang and Jun Wang and Yaodong Yang
Abstract summary: We show that the gradient noise possesses finite variance, and therefore the Central Limit Theorem (CLT) applies. We then demonstrate the existence of the steady-state distribution of gradient descent and approximate the distribution at a small learning rate.
Score: 25.95229631113089
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we characterize the noise of stochastic gradients and analyze the noise-induced dynamics during training deep neural networks by gradient-based optimizers. Specifically, we firstly show that the stochastic gradient noise possesses finite variance, and therefore the classical Central Limit Theorem (CLT) applies; this indicates that the gradient noise is asymptotically Gaussian. Such an asymptotic result validates the wide-accepted assumption of Gaussian noise. We clarify that the recently observed phenomenon of heavy tails within gradient noise may not be intrinsic properties, but the consequence of insufficient mini-batch size; the gradient noise, which is a sum of limited i.i.d. random variables, has not reached the asymptotic regime of CLT, thus deviates from Gaussian. We quantitatively measure the goodness of Gaussian approximation of the noise, which supports our conclusion. Secondly, we analyze the noise-induced dynamics of stochastic gradient descent using the Langevin equation, granting for momentum hyperparameter in the optimizer with a physical interpretation. We then proceed to demonstrate the existence of the steady-state distribution of stochastic gradient descent and approximate the distribution at a small learning rate.

Related papers

Role of Momentum in Smoothing Objective Function and Generalizability of Deep Neural Networks [0.6906005491572401]
We show that noise in gradient descent (SGD) with momentum smoothes the objective function, the degree of which is determined by the learning rate, the batch size, the momentum factor, and the upper bound of the norm. We also provide experimental results supporting our assertion model generalizability depends on the noise level.
arXiv Detail & Related papers (2024-02-04T02:48:28Z)
Doubly Stochastic Models: Learning with Unbiased Label Noises and Inference Stability [85.1044381834036]
We investigate the implicit regularization effects of label noises under mini-batch sampling settings of gradient descent. We find such implicit regularizer would favor some convergence points that could stabilize model outputs against perturbation of parameters. Our work doesn't assume SGD as an Ornstein-Uhlenbeck like process and achieve a more general result with convergence of approximation proved.
arXiv Detail & Related papers (2023-04-01T14:09:07Z)
Asymptotic consistency of the WSINDy algorithm in the limit of continuum data [0.0]
We study the consistency of the weak-form sparse identification of nonlinear dynamics algorithm (WSINDy) We provide a mathematically rigorous explanation for the observed robustness to noise of weak-form equation learning.
arXiv Detail & Related papers (2022-11-29T07:49:34Z)
A note on diffusion limits for stochastic gradient descent [0.0]
Much of the theory, that attempts to clarify the role of noise in gradient algorithms, has widely approximated gradient descent by a differential equation with Gaussian noise. We provide a novel theoretical justification for this practice that showcases how the noise arises naturally.
arXiv Detail & Related papers (2022-10-20T13:27:00Z)
High-Order Qubit Dephasing at Sweet Spots by Non-Gaussian Fluctuators: Symmetry Breaking and Floquet Protection [55.41644538483948]
We study the qubit dephasing caused by the non-Gaussian fluctuators. We predict a symmetry-breaking effect that is unique to the non-Gaussian noise.
arXiv Detail & Related papers (2022-06-06T18:02:38Z)
Computing the Variance of Shuffling Stochastic Gradient Algorithms via Power Spectral Density Analysis [6.497816402045099]
Two common alternatives to gradient descent (SGD) with theoretical benefits are random reshuffling (SGDRR) and shuffle-once (SGD-SO) We study the stationary variances of SGD, SGDRR and SGD-SO, whose leading terms decrease in this order, and obtain simple approximations.
arXiv Detail & Related papers (2022-06-01T17:08:04Z)
Differentiable Annealed Importance Sampling and the Perils of Gradient Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation. Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective. We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z)
Asymmetric Heavy Tails and Implicit Bias in Gaussian Noise Injections [73.95786440318369]
We focus on the so-called implicit effect' of GNIs, which is the effect of the injected noise on the dynamics of gradient descent (SGD) We show that this effect induces an asymmetric heavy-tailed noise on gradient updates. We then formally prove that GNIs induce an implicit bias', which varies depending on the heaviness of the tails and the level of asymmetry.
arXiv Detail & Related papers (2021-02-13T21:28:09Z)
Shape Matters: Understanding the Implicit Bias of the Noise Covariance [76.54300276636982]
Noise in gradient descent provides a crucial implicit regularization effect for training over parameterized models. We show that parameter-dependent noise -- induced by mini-batches or label perturbation -- is far more effective than Gaussian noise. Our analysis reveals that parameter-dependent noise introduces a bias towards local minima with smaller noise variance, whereas spherical Gaussian noise does not.
arXiv Detail & Related papers (2020-06-15T18:31:02Z)
Stochastic Optimization with Heavy-Tailed Noise via Accelerated Gradient Clipping [69.9674326582747]
We propose a new accelerated first-order method called clipped-SSTM for smooth convex optimization with heavy-tailed distributed noise in gradients. We prove new complexity that outperform state-of-the-art results in this case. We derive the first non-trivial high-probability complexity bounds for SGD with clipping without light-tails assumption on the noise.
arXiv Detail & Related papers (2020-05-21T17:05:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.