Eliminating Sharp Minima from SGD with Truncated Heavy-tailed Noise
- URL: http://arxiv.org/abs/2102.04297v1
- Date: Mon, 8 Feb 2021 16:03:49 GMT
- Title: Eliminating Sharp Minima from SGD with Truncated Heavy-tailed Noise
- Authors: Xingyu Wang, Sewoong Oh, Chang-Han Rhee
- Abstract summary: Evidence of heavy-tailed gradient noise was reported in many deep learning tasks.
We show that truncated SGD can effectively eliminate sharp local minima entirely from its training trajectory.
- Score: 39.27123042800951
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The empirical success of deep learning is often attributed to SGD's
mysterious ability to avoid sharp local minima in the loss landscape, which is
well known to lead to poor generalization. Recently, empirical evidence of
heavy-tailed gradient noise was reported in many deep learning tasks; under the
presence of such heavy-tailed noise, it can be shown that SGD can escape sharp
local minima, providing a partial solution to the mystery. In this work, we
analyze a popular variant of SGD where gradients are truncated above a fixed
threshold. We show that it achieves a stronger notion of avoiding sharp minima;
it can effectively eliminate sharp local minima entirely from its training
trajectory. We characterize the dynamics of truncated SGD driven by
heavy-tailed noises. First, we show that the truncation threshold and width of
the attraction field dictate the order of the first exit time from the
associated local minimum. Moreover, when the objective function satisfies
appropriate structural conditions, we prove that as the learning rate decreases
the dynamics of the heavy-tailed SGD closely resemble that of a special
continuous-time Markov chain which never visits any sharp minima. We verify our
theoretical results with numerical experiments and discuss the implications on
the generalizability of SGD in deep learning.
Related papers
- Global Dynamics of Heavy-Tailed SGDs in Nonconvex Loss Landscape: Characterization and Control [7.665296591586615]
gradient descent (SGD) and its variants enable modern artificial intelligence.<n>It is widely believed that SGD has a curious ability to avoid sharp local minima in the loss landscape.<n>We reveal a fascinating phenomenon in deep learning: by injecting and then truncating heavy-tailed noises during the training phase, SGD can almost completely avoid sharp minima.
arXiv Detail & Related papers (2025-10-23T18:01:29Z) - Simplicity Bias via Global Convergence of Sharpness Minimization [43.658859631741024]
We show that label noise SGD always minimizes the sharpness on the manifold of models with zero loss for two-layer networks.
We also find a novel property of the trace of Hessian of the loss at approximate stationary points on the manifold of zero loss.
arXiv Detail & Related papers (2024-10-21T18:10:37Z) - Butterfly Effects of SGD Noise: Error Amplification in Behavior Cloning
and Autoregression [70.78523583702209]
We study training instabilities of behavior cloning with deep neural networks.
We observe that minibatch SGD updates to the policy network during training result in sharp oscillations in long-horizon rewards.
arXiv Detail & Related papers (2023-10-17T17:39:40Z) - Low-Precision Stochastic Gradient Langevin Dynamics [70.69923368584588]
We provide the first study of low-precision Gradient Langevin Dynamics, showing that its costs can be significantly reduced without sacrificing performance.
We develop a new quantization function for SGLD that preserves the variance in each update step.
We demonstrate that low-precision SGLD achieves comparable performance to full-precision SGLD with only 8 bits on a variety of deep learning tasks.
arXiv Detail & Related papers (2022-06-20T17:25:41Z) - SGD with a Constant Large Learning Rate Can Converge to Local Maxima [4.014524824655106]
We construct worst-case optimization problems illustrating that gradient descent can exhibit strange and potentially undesirable behaviors.
Specifically, we construct landscapes and data distributions such that SGD converges to local maxima.
Our results highlight the importance of simultaneously analyzing the minibatch sampling, discrete-time updates rules, and realistic landscapes.
arXiv Detail & Related papers (2021-07-25T10:12:18Z) - Label Noise SGD Provably Prefers Flat Global Minimizers [48.883469271546076]
In overparametrized models, the noise in gradient descent (SGD) implicitly regularizes the optimization trajectory and determines which local minimum SGD converges to.
We show that SGD with label noise converges to a stationary point of a regularized loss $L(theta) +lambda R(theta)$, where $L(theta)$ is the training loss.
Our analysis uncovers an additional regularization effect of large learning rates beyond the linear scaling rule that penalizes large eigenvalues of the Hessian more than small ones.
arXiv Detail & Related papers (2021-06-11T17:59:07Z) - Noisy Truncated SGD: Optimization and Generalization [27.33458360279836]
Recent empirical work on SGD has shown that most gradient components over epochs are quite small.
Inspired by such a study, we rigorously study properties of noisy SGD (NT-SGD)
We prove that NT-SGD can provably escape from saddle points and requires less noise compared to previous related work.
arXiv Detail & Related papers (2021-02-26T22:39:41Z) - Dynamic of Stochastic Gradient Descent with State-Dependent Noise [84.64013284862733]
gradient descent (SGD) and its variants are mainstream methods to train deep neural networks.
We show that the covariance of the noise of SGD in the local region of the local minima is a quadratic function of the state.
We propose a novel power-law dynamic with state-dependent diffusion to approximate the dynamic of SGD.
arXiv Detail & Related papers (2020-06-24T13:34:38Z) - A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient
Descent Exponentially Favors Flat Minima [91.11332770406007]
We show that Gradient Descent (SGD) favors flat minima exponentially more than sharp minima.
We also reveal that either a small learning rate or large-batch training requires exponentially many iterations to escape from minima.
arXiv Detail & Related papers (2020-02-10T02:04:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.