The Heavy-Tail Phenomenon in SGD
- URL: http://arxiv.org/abs/2006.04740v5
- Date: Mon, 14 Jun 2021 15:45:36 GMT
- Title: The Heavy-Tail Phenomenon in SGD
- Authors: Mert Gurbuzbalaban, Umut \c{S}im\c{s}ekli, Lingjiong Zhu
- Abstract summary: We show that depending on the structure of the Hessian of the loss at the minimum, the SGD iterates will converge to a emphheavy-tailed stationary distribution.
We translate our results into insights about the behavior of SGD in deep learning.
- Score: 7.366405857677226
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, various notions of capacity and complexity have been
proposed for characterizing the generalization properties of stochastic
gradient descent (SGD) in deep learning. Some of the popular notions that
correlate well with the performance on unseen data are (i) the `flatness' of
the local minimum found by SGD, which is related to the eigenvalues of the
Hessian, (ii) the ratio of the stepsize $\eta$ to the batch-size $b$, which
essentially controls the magnitude of the stochastic gradient noise, and (iii)
the `tail-index', which measures the heaviness of the tails of the network
weights at convergence. In this paper, we argue that these three seemingly
unrelated perspectives for generalization are deeply linked to each other. We
claim that depending on the structure of the Hessian of the loss at the
minimum, and the choices of the algorithm parameters $\eta$ and $b$, the SGD
iterates will converge to a \emph{heavy-tailed} stationary distribution. We
rigorously prove this claim in the setting of quadratic optimization: we show
that even in a simple linear regression problem with independent and
identically distributed data whose distribution has finite moments of all
order, the iterates can be heavy-tailed with infinite variance. We further
characterize the behavior of the tails with respect to algorithm parameters,
the dimension, and the curvature. We then translate our results into insights
about the behavior of SGD in deep learning. We support our theory with
experiments conducted on synthetic data, fully connected, and convolutional
neural networks.
Related papers
- From Gradient Clipping to Normalization for Heavy Tailed SGD [19.369399536643773]
Recent empirical evidence indicates that machine learning applications involve heavy-tailed noise, which challenges the standard assumptions of bounded variance in practice.
In this paper, we show that it is possible to achieve tightness of the gradient-dependent noise convergence problem under tailed noise.
arXiv Detail & Related papers (2024-10-17T17:59:01Z) - Curvature-Independent Last-Iterate Convergence for Games on Riemannian
Manifolds [77.4346324549323]
We show that a step size agnostic to the curvature of the manifold achieves a curvature-independent and linear last-iterate convergence rate.
To the best of our knowledge, the possibility of curvature-independent rates and/or last-iterate convergence has not been considered before.
arXiv Detail & Related papers (2023-06-29T01:20:44Z) - Understanding the Generalization Ability of Deep Learning Algorithms: A
Kernelized Renyi's Entropy Perspective [11.255943520955764]
We propose a novel information theoretical measure: kernelized Renyi's entropy.
We establish the generalization error bounds for gradient/Langevin descent (SGD/SGLD) learning algorithms under kernelized Renyi's entropy.
We show that our information-theoretical bounds depend on the statistics of the gradients, and are rigorously tighter than the current state-of-the-art (SOTA) results.
arXiv Detail & Related papers (2023-05-02T01:17:15Z) - Stability and Generalization Analysis of Gradient Methods for Shallow
Neural Networks [59.142826407441106]
We study the generalization behavior of shallow neural networks (SNNs) by leveraging the concept of algorithmic stability.
We consider gradient descent (GD) and gradient descent (SGD) to train SNNs, for both of which we develop consistent excess bounds.
arXiv Detail & Related papers (2022-09-19T18:48:00Z) - Heavy-Tail Phenomenon in Decentralized SGD [33.63000461985398]
We study the emergence of heavy-tails in decentralized gradient descent (DE-SGD)
We also investigate the effect of decentralization on the tail behavior.
Our theory uncovers an interesting interplay between the tails and the network structure.
arXiv Detail & Related papers (2022-05-13T14:47:04Z) - Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent.
We show that SGD is biased towards a simple solution.
We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z) - On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD)
We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting.
We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z) - Benign Overfitting of Constant-Stepsize SGD for Linear Regression [122.70478935214128]
inductive biases are central in preventing overfitting empirically.
This work considers this issue in arguably the most basic setting: constant-stepsize SGD for linear regression.
We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares.
arXiv Detail & Related papers (2021-03-23T17:15:53Z) - SGD in the Large: Average-case Analysis, Asymptotics, and Stepsize
Criticality [15.640534097470923]
We propose a new framework for analyzing the dynamics of gradient descent (SGD) when both number of samples and dimensions are large.
Using this new framework, we show that the dynamics of SGD on a least squares problem with random data become deterministic in the large sample and dimensional limit.
arXiv Detail & Related papers (2021-02-08T18:00:13Z) - Multiplicative noise and heavy tails in stochastic optimization [62.993432503309485]
empirical optimization is central to modern machine learning, but its role in its success is still unclear.
We show that it commonly arises in parameters of discrete multiplicative noise due to variance.
A detailed analysis is conducted in which we describe on key factors, including recent step size, and data, all exhibit similar results on state-of-the-art neural network models.
arXiv Detail & Related papers (2020-06-11T09:58:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.