Related papers: Dynamic of Stochastic Gradient Descent with State-Dependent Noise

Dynamic of Stochastic Gradient Descent with State-Dependent Noise

URL: http://arxiv.org/abs/2006.13719v3
Date: Mon, 12 Oct 2020 11:40:36 GMT
Title: Dynamic of Stochastic Gradient Descent with State-Dependent Noise
Authors: Qi Meng, Shiqi Gong, Wei Chen, Zhi-Ming Ma, Tie-Yan Liu
Abstract summary: gradient descent (SGD) and its variants are mainstream methods to train deep neural networks. We show that the covariance of the noise of SGD in the local region of the local minima is a quadratic function of the state. We propose a novel power-law dynamic with state-dependent diffusion to approximate the dynamic of SGD.
Score: 84.64013284862733
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Stochastic gradient descent (SGD) and its variants are mainstream methods to train deep neural networks. Since neural networks are non-convex, more and more works study the dynamic behavior of SGD and the impact to its generalization, especially the escaping efficiency from local minima. However, these works take the over-simplified assumption that the covariance of the noise in SGD is (or can be upper bounded by) constant, although it is actually state-dependent. In this work, we conduct a formal study on the dynamic behavior of SGD with state-dependent noise. Specifically, we show that the covariance of the noise of SGD in the local region of the local minima is a quadratic function of the state. Thus, we propose a novel power-law dynamic with state-dependent diffusion to approximate the dynamic of SGD. We prove that, power-law dynamic can escape from sharp minima exponentially faster than flat minima, while the previous dynamics can only escape sharp minima polynomially faster than flat minima. Our experiments well verified our theoretical results. Inspired by our theory, we propose to add additional state-dependent noise into (large-batch) SGD to further improve its generalization ability. Experiments verify that our method is effective.

Related papers

Butterfly Effects of SGD Noise: Error Amplification in Behavior Cloning and Autoregression [70.78523583702209]
We study training instabilities of behavior cloning with deep neural networks. We observe that minibatch SGD updates to the policy network during training result in sharp oscillations in long-horizon rewards.
arXiv Detail & Related papers (2023-10-17T17:39:40Z)
SGD with Large Step Sizes Learns Sparse Features [22.959258640051342]
We showcase important features of the dynamics of the Gradient Descent (SGD) in the training of neural networks. We show that the longer large step sizes keep SGD high in the loss landscape, the better the implicit regularization can operate and find sparse representations.
arXiv Detail & Related papers (2022-10-11T11:00:04Z)
How Can Increased Randomness in Stochastic Gradient Descent Improve Generalization? [0.0]
We study the role of the SGD learning rate and batch size in generalization. We show that increasing SGD temperature encourages the selection of local minima with lower curvature.
arXiv Detail & Related papers (2021-08-21T13:18:49Z)
Noisy Truncated SGD: Optimization and Generalization [27.33458360279836]
Recent empirical work on SGD has shown that most gradient components over epochs are quite small. Inspired by such a study, we rigorously study properties of noisy SGD (NT-SGD) We prove that NT-SGD can provably escape from saddle points and requires less noise compared to previous related work.
arXiv Detail & Related papers (2021-02-26T22:39:41Z)
Asymmetric Heavy Tails and Implicit Bias in Gaussian Noise Injections [73.95786440318369]
We focus on the so-called implicit effect' of GNIs, which is the effect of the injected noise on the dynamics of gradient descent (SGD) We show that this effect induces an asymmetric heavy-tailed noise on gradient updates. We then formally prove that GNIs induce an implicit bias', which varies depending on the heaviness of the tails and the level of asymmetry.
arXiv Detail & Related papers (2021-02-13T21:28:09Z)
Noise and Fluctuation of Finite Learning Rate Stochastic Gradient Descent [3.0079490585515343]
gradient descent (SGD) is relatively well understood in the vanishing learning rate regime. We propose to study the basic properties of SGD and its variants in the non-vanishing learning rate regime.
arXiv Detail & Related papers (2020-12-07T12:31:43Z)
Direction Matters: On the Implicit Bias of Stochastic Gradient Descent with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime. We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z)
Towards Theoretically Understanding Why SGD Generalizes Better Than ADAM in Deep Learning [165.47118387176607]
It is not clear yet why ADAM-alike adaptive gradient algorithms suffer from worse generalization performance than SGD despite their faster training speed. Specifically, we observe the heavy tails of gradient noise in these algorithms.
arXiv Detail & Related papers (2020-10-12T12:00:26Z)
Shape Matters: Understanding the Implicit Bias of the Noise Covariance [76.54300276636982]
Noise in gradient descent provides a crucial implicit regularization effect for training over parameterized models. We show that parameter-dependent noise -- induced by mini-batches or label perturbation -- is far more effective than Gaussian noise. Our analysis reveals that parameter-dependent noise introduces a bias towards local minima with smaller noise variance, whereas spherical Gaussian noise does not.
arXiv Detail & Related papers (2020-06-15T18:31:02Z)
Fractional Underdamped Langevin Dynamics: Retargeting SGD with Momentum under Heavy-Tailed Gradient Noise [39.9241638707715]
We show that FULD has similarities with enatural and egradient methods on their role in deep learning.
arXiv Detail & Related papers (2020-02-13T18:04:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.