Related papers: Noise and Fluctuation of Finite Learning Rate Stochastic Gradient Descent

Noise and Fluctuation of Finite Learning Rate Stochastic Gradient Descent

URL: http://arxiv.org/abs/2012.03636v3
Date: Fri, 12 Feb 2021 08:43:27 GMT
Title: Noise and Fluctuation of Finite Learning Rate Stochastic Gradient Descent
Authors: Kangqiao Liu, Liu Ziyin, Masahito Ueda
Abstract summary: gradient descent (SGD) is relatively well understood in the vanishing learning rate regime. We propose to study the basic properties of SGD and its variants in the non-vanishing learning rate regime.
Score: 3.0079490585515343
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In the vanishing learning rate regime, stochastic gradient descent (SGD) is now relatively well understood. In this work, we propose to study the basic properties of SGD and its variants in the non-vanishing learning rate regime. The focus is on deriving exactly solvable results and discussing their implications. The main contributions of this work are to derive the stationary distribution for discrete-time SGD in a quadratic loss function with and without momentum; in particular, one implication of our result is that the fluctuation caused by discrete-time dynamics takes a distorted shape and is dramatically larger than a continuous-time theory could predict. Examples of applications of the proposed theory considered in this work include the approximation error of variants of SGD, the effect of minibatch noise, the optimal Bayesian inference, the escape rate from a sharp minimum, and the stationary distribution of a few second-order methods including damped Newton's method and natural gradient descent.

Related papers

First and Second Order Approximations to Stochastic Gradient Descent Methods with Momentum Terms [0.0]
Gradient Descent (SGD) methods see many uses in optimization problems. We present approximation results under weak assumptions for SGD that allow learning rates and momentum parameters to vary with respect to time.
arXiv Detail & Related papers (2025-04-18T16:49:46Z)
A Hessian-Aware Stochastic Differential Equation for Modelling SGD [28.974147174627102]
Hessian-Aware Modified Equation (HA-SME) is an approximation that incorporates Hessian information of the objective function into both its drift and diffusion terms.<n>Under mild conditions, HA-SME is proved to be the first SDE model that recovers exactly the SGD dynamics in the distributional sense.
arXiv Detail & Related papers (2024-05-28T17:11:34Z)
Role of Momentum in Smoothing Objective Function and Generalizability of Deep Neural Networks [0.6906005491572401]
We show that noise in gradient descent (SGD) with momentum smoothes the objective function, the degree of which is determined by the learning rate, the batch size, the momentum factor, and the upper bound of the norm. We also provide experimental results supporting our assertion model generalizability depends on the noise level.
arXiv Detail & Related papers (2024-02-04T02:48:28Z)
Butterfly Effects of SGD Noise: Error Amplification in Behavior Cloning and Autoregression [70.78523583702209]
We study training instabilities of behavior cloning with deep neural networks. We observe that minibatch SGD updates to the policy network during training result in sharp oscillations in long-horizon rewards.
arXiv Detail & Related papers (2023-10-17T17:39:40Z)
The Marginal Value of Momentum for Small Learning Rate SGD [20.606430391298815]
Momentum is known to accelerate the convergence of gradient descent in strongly convex settings without gradient noise regimes. Experiments show that momentum indeed has limited benefits for both optimization and generalization in practical training where the optimal learning rate is not very large.
arXiv Detail & Related papers (2023-07-27T21:01:26Z)
Doubly Stochastic Models: Learning with Unbiased Label Noises and Inference Stability [85.1044381834036]
We investigate the implicit regularization effects of label noises under mini-batch sampling settings of gradient descent. We find such implicit regularizer would favor some convergence points that could stabilize model outputs against perturbation of parameters. Our work doesn't assume SGD as an Ornstein-Uhlenbeck like process and achieve a more general result with convergence of approximation proved.
arXiv Detail & Related papers (2023-04-01T14:09:07Z)
Computing the Variance of Shuffling Stochastic Gradient Algorithms via Power Spectral Density Analysis [6.497816402045099]
Two common alternatives to gradient descent (SGD) with theoretical benefits are random reshuffling (SGDRR) and shuffle-once (SGD-SO) We study the stationary variances of SGD, SGDRR and SGD-SO, whose leading terms decrease in this order, and obtain simple approximations.
arXiv Detail & Related papers (2022-06-01T17:08:04Z)
The effective noise of Stochastic Gradient Descent [9.645196221785694]
Gradient Descent (SGD) is the workhorse algorithm of deep learning technology. We characterize the parameters of SGD and a recently-introduced variant, persistent SGD, in a neural network model. We find that noisier algorithms lead to wider decision boundaries of the corresponding constraint satisfaction problem.
arXiv Detail & Related papers (2021-12-20T20:46:19Z)
Direction Matters: On the Implicit Bias of Stochastic Gradient Descent with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime. We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z)
Dynamic of Stochastic Gradient Descent with State-Dependent Noise [84.64013284862733]
gradient descent (SGD) and its variants are mainstream methods to train deep neural networks. We show that the covariance of the noise of SGD in the local region of the local minima is a quadratic function of the state. We propose a novel power-law dynamic with state-dependent diffusion to approximate the dynamic of SGD.
arXiv Detail & Related papers (2020-06-24T13:34:38Z)
Shape Matters: Understanding the Implicit Bias of the Noise Covariance [76.54300276636982]
Noise in gradient descent provides a crucial implicit regularization effect for training over parameterized models. We show that parameter-dependent noise -- induced by mini-batches or label perturbation -- is far more effective than Gaussian noise. Our analysis reveals that parameter-dependent noise introduces a bias towards local minima with smaller noise variance, whereas spherical Gaussian noise does not.
arXiv Detail & Related papers (2020-06-15T18:31:02Z)
On Learning Rates and Schr\"odinger Operators [105.32118775014015]
We present a general theoretical analysis of the effect of the learning rate. We find that the learning rate tends to zero for a broad non- neural class functions.
arXiv Detail & Related papers (2020-04-15T09:52:37Z)
Fractional Underdamped Langevin Dynamics: Retargeting SGD with Momentum under Heavy-Tailed Gradient Noise [39.9241638707715]
We show that FULD has similarities with enatural and egradient methods on their role in deep learning.
arXiv Detail & Related papers (2020-02-13T18:04:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.