Understanding Long Range Memory Effects in Deep Neural Networks
- URL: http://arxiv.org/abs/2105.02062v2
- Date: Thu, 6 May 2021 02:49:44 GMT
- Title: Understanding Long Range Memory Effects in Deep Neural Networks
- Authors: Chengli Tan, Jiangshe Zhang, and Junmin Liu
- Abstract summary: textitstochastic gradient descent (SGD) is of fundamental importance in deep learning.
In this study, we argue that SGN is neither Gaussian nor stable. Instead, we propose that SGD can be viewed as a discretization of an SDE driven by textitfractional Brownian motion (FBM)
- Score: 10.616643031188248
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: \textit{Stochastic gradient descent} (SGD) is of fundamental importance in
deep learning. Despite its simplicity, elucidating its efficacy remains
challenging. Conventionally, the success of SGD is attributed to the
\textit{stochastic gradient noise} (SGN) incurred in the training process.
Based on this general consensus, SGD is frequently treated and analyzed as the
Euler-Maruyama discretization of a \textit{stochastic differential equation}
(SDE) driven by either Brownian or L\'evy stable motion. In this study, we
argue that SGN is neither Gaussian nor stable. Instead, inspired by the
long-time correlation emerging in SGN series, we propose that SGD can be viewed
as a discretization of an SDE driven by \textit{fractional Brownian motion}
(FBM). Accordingly, the different convergence behavior of SGD dynamics is well
grounded. Moreover, the first passage time of an SDE driven by FBM is
approximately derived. This indicates a lower escaping rate for a larger Hurst
parameter, and thus SGD stays longer in flat minima. This happens to coincide
with the well-known phenomenon that SGD favors flat minima that generalize
well. Four groups of experiments are conducted to validate our conjecture, and
it is demonstrated that long-range memory effects persist across various model
architectures, datasets, and training strategies. Our study opens up a new
perspective and may contribute to a better understanding of SGD.
Related papers
- The Optimality of (Accelerated) SGD for High-Dimensional Quadratic Optimization [4.7256945641654164]
gradient descent (SGD) is a widely used algorithm in machine learning, particularly for neural network training.
Recent studies on SGD for canonical quadratic optimization or linear regression show it attains well generalization under suitable high-dimensional settings.
This paper investigates SGD with two essential components in practice: exponentially decaying step size schedule and momentum.
arXiv Detail & Related papers (2024-09-15T14:20:03Z) - Distributed Stochastic Gradient Descent with Staleness: A Stochastic Delay Differential Equation Based Framework [56.82432591933544]
Distributed gradient descent (SGD) has attracted considerable recent attention due to its potential for scaling computational resources, reducing training time, and helping protect user privacy in machine learning.
This paper presents the run time and staleness of distributed SGD based on delay differential equations (SDDEs) and the approximation of gradient arrivals.
It is interestingly shown that increasing the number of activated workers does not necessarily accelerate distributed SGD due to staleness.
arXiv Detail & Related papers (2024-06-17T02:56:55Z) - Revisiting the Noise Model of Stochastic Gradient Descent [5.482532589225552]
gradient noise (SGN) is a significant factor in the success of gradient descent.
We show that SGN is heavy-tailed and better depicted by the $Salpha S$ distribution.
arXiv Detail & Related papers (2023-03-05T18:55:12Z) - From Gradient Flow on Population Loss to Learning with Stochastic
Gradient Descent [50.4531316289086]
Gradient Descent (SGD) has been the method of choice for learning large-scale non-root models.
An overarching paper is providing general conditions SGD converges, assuming that GF on the population loss converges.
We provide a unified analysis for GD/SGD not only for classical settings like convex losses, but also for more complex problems including Retrieval Matrix sq-root.
arXiv Detail & Related papers (2022-10-13T03:55:04Z) - Benign Underfitting of Stochastic Gradient Descent [72.38051710389732]
We study to what extent may gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit training data.
We analyze the closely related with-replacement SGD, for which an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate.
arXiv Detail & Related papers (2022-02-27T13:25:01Z) - SGD with a Constant Large Learning Rate Can Converge to Local Maxima [4.014524824655106]
We construct worst-case optimization problems illustrating that gradient descent can exhibit strange and potentially undesirable behaviors.
Specifically, we construct landscapes and data distributions such that SGD converges to local maxima.
Our results highlight the importance of simultaneously analyzing the minibatch sampling, discrete-time updates rules, and realistic landscapes.
arXiv Detail & Related papers (2021-07-25T10:12:18Z) - Noisy Truncated SGD: Optimization and Generalization [27.33458360279836]
Recent empirical work on SGD has shown that most gradient components over epochs are quite small.
Inspired by such a study, we rigorously study properties of noisy SGD (NT-SGD)
We prove that NT-SGD can provably escape from saddle points and requires less noise compared to previous related work.
arXiv Detail & Related papers (2021-02-26T22:39:41Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - Dynamic of Stochastic Gradient Descent with State-Dependent Noise [84.64013284862733]
gradient descent (SGD) and its variants are mainstream methods to train deep neural networks.
We show that the covariance of the noise of SGD in the local region of the local minima is a quadratic function of the state.
We propose a novel power-law dynamic with state-dependent diffusion to approximate the dynamic of SGD.
arXiv Detail & Related papers (2020-06-24T13:34:38Z) - On the Generalization of Stochastic Gradient Descent with Momentum [84.54924994010703]
momentum-based accelerated variants of gradient descent (SGD) are widely used when training machine learning models.
We first show that there exists a convex loss function for which the stability gap for multiple epochs of SGD with standard heavy-ball momentum (SGDM) becomes unbounded.
For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, i.e., SGD with early momentum (SGDEM) under a broad range of step-sizes.
arXiv Detail & Related papers (2018-09-12T17:02:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.