Related papers: Eliminating Sharp Minima from SGD with Truncated Heavy-tailed Noise

Eliminating Sharp Minima from SGD with Truncated Heavy-tailed Noise

URL: http://arxiv.org/abs/2102.04297v1
Date: Mon, 8 Feb 2021 16:03:49 GMT
Title: Eliminating Sharp Minima from SGD with Truncated Heavy-tailed Noise
Authors: Xingyu Wang, Sewoong Oh, Chang-Han Rhee
Abstract summary: Evidence of heavy-tailed gradient noise was reported in many deep learning tasks. We show that truncated SGD can effectively eliminate sharp local minima entirely from its training trajectory.
Score: 39.27123042800951
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The empirical success of deep learning is often attributed to SGD's mysterious ability to avoid sharp local minima in the loss landscape, which is well known to lead to poor generalization. Recently, empirical evidence of heavy-tailed gradient noise was reported in many deep learning tasks; under the presence of such heavy-tailed noise, it can be shown that SGD can escape sharp local minima, providing a partial solution to the mystery. In this work, we analyze a popular variant of SGD where gradients are truncated above a fixed threshold. We show that it achieves a stronger notion of avoiding sharp minima; it can effectively eliminate sharp local minima entirely from its training trajectory. We characterize the dynamics of truncated SGD driven by heavy-tailed noises. First, we show that the truncation threshold and width of the attraction field dictate the order of the first exit time from the associated local minimum. Moreover, when the objective function satisfies appropriate structural conditions, we prove that as the learning rate decreases the dynamics of the heavy-tailed SGD closely resemble that of a special continuous-time Markov chain which never visits any sharp minima. We verify our theoretical results with numerical experiments and discuss the implications on the generalizability of SGD in deep learning.

Related papers

Global Dynamics of Heavy-Tailed SGDs in Nonconvex Loss Landscape: Characterization and Control [7.665296591586615]
gradient descent (SGD) and its variants enable modern artificial intelligence.<n>It is widely believed that SGD has a curious ability to avoid sharp local minima in the loss landscape.<n>We reveal a fascinating phenomenon in deep learning: by injecting and then truncating heavy-tailed noises during the training phase, SGD can almost completely avoid sharp minima.
arXiv Detail & Related papers (2025-10-23T18:01:29Z)
Simplicity Bias via Global Convergence of Sharpness Minimization [43.658859631741024]
We show that label noise SGD always minimizes the sharpness on the manifold of models with zero loss for two-layer networks. We also find a novel property of the trace of Hessian of the loss at approximate stationary points on the manifold of zero loss.
arXiv Detail & Related papers (2024-10-21T18:10:37Z)
Butterfly Effects of SGD Noise: Error Amplification in Behavior Cloning and Autoregression [70.78523583702209]
We study training instabilities of behavior cloning with deep neural networks. We observe that minibatch SGD updates to the policy network during training result in sharp oscillations in long-horizon rewards.
arXiv Detail & Related papers (2023-10-17T17:39:40Z)
Low-Precision Stochastic Gradient Langevin Dynamics [70.69923368584588]
We provide the first study of low-precision Gradient Langevin Dynamics, showing that its costs can be significantly reduced without sacrificing performance. We develop a new quantization function for SGLD that preserves the variance in each update step. We demonstrate that low-precision SGLD achieves comparable performance to full-precision SGLD with only 8 bits on a variety of deep learning tasks.
arXiv Detail & Related papers (2022-06-20T17:25:41Z)
SGD with a Constant Large Learning Rate Can Converge to Local Maxima [4.014524824655106]
We construct worst-case optimization problems illustrating that gradient descent can exhibit strange and potentially undesirable behaviors. Specifically, we construct landscapes and data distributions such that SGD converges to local maxima. Our results highlight the importance of simultaneously analyzing the minibatch sampling, discrete-time updates rules, and realistic landscapes.
arXiv Detail & Related papers (2021-07-25T10:12:18Z)
Label Noise SGD Provably Prefers Flat Global Minimizers [48.883469271546076]
In overparametrized models, the noise in gradient descent (SGD) implicitly regularizes the optimization trajectory and determines which local minimum SGD converges to. We show that SGD with label noise converges to a stationary point of a regularized loss $L(theta) +lambda R(theta)$, where $L(theta)$ is the training loss. Our analysis uncovers an additional regularization effect of large learning rates beyond the linear scaling rule that penalizes large eigenvalues of the Hessian more than small ones.
arXiv Detail & Related papers (2021-06-11T17:59:07Z)
Noisy Truncated SGD: Optimization and Generalization [27.33458360279836]
Recent empirical work on SGD has shown that most gradient components over epochs are quite small. Inspired by such a study, we rigorously study properties of noisy SGD (NT-SGD) We prove that NT-SGD can provably escape from saddle points and requires less noise compared to previous related work.
arXiv Detail & Related papers (2021-02-26T22:39:41Z)
Dynamic of Stochastic Gradient Descent with State-Dependent Noise [84.64013284862733]
gradient descent (SGD) and its variants are mainstream methods to train deep neural networks. We show that the covariance of the noise of SGD in the local region of the local minima is a quadratic function of the state. We propose a novel power-law dynamic with state-dependent diffusion to approximate the dynamic of SGD.
arXiv Detail & Related papers (2020-06-24T13:34:38Z)
A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient Descent Exponentially Favors Flat Minima [91.11332770406007]
We show that Gradient Descent (SGD) favors flat minima exponentially more than sharp minima. We also reveal that either a small learning rate or large-batch training requires exponentially many iterations to escape from minima.
arXiv Detail & Related papers (2020-02-10T02:04:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.