Algorithmic Stability of Heavy-Tailed Stochastic Gradient Descent on
Least Squares
- URL: http://arxiv.org/abs/2206.01274v1
- Date: Thu, 2 Jun 2022 19:59:48 GMT
- Title: Algorithmic Stability of Heavy-Tailed Stochastic Gradient Descent on
Least Squares
- Authors: Anant Raj, Melih Barsbey, Mert G\"urb\"uzbalaban, Lingjiong Zhu and
Umut \c{S}im\c{s}ekli
- Abstract summary: Recent studies have shown that heavy tails can emerge in optimization and that the heaviness of the tails has links to the generalization error.
We establish novel links between the tail behavior and generalization properties of gradient descent (SGD) through the lens of algorithmic stability.
We support our theory with synthetic and real neural network experiments.
- Score: 12.2950446921662
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Recent studies have shown that heavy tails can emerge in stochastic
optimization and that the heaviness of the tails has links to the
generalization error. While these studies have shed light on interesting
aspects of the generalization behavior in modern settings, they relied on
strong topological and statistical regularity assumptions, which are hard to
verify in practice. Furthermore, it has been empirically illustrated that the
relation between heavy tails and generalization might not always be monotonic
in practice, contrary to the conclusions of existing theory. In this study, we
establish novel links between the tail behavior and generalization properties
of stochastic gradient descent (SGD), through the lens of algorithmic
stability. We consider a quadratic optimization problem and use a heavy-tailed
stochastic differential equation as a proxy for modeling the heavy-tailed
behavior emerging in SGD. We then prove uniform stability bounds, which reveal
the following outcomes: (i) Without making any exotic assumptions, we show that
SGD will not be stable if the stability is measured with the squared-loss
$x\mapsto x^2$, whereas it in turn becomes stable if the stability is instead
measured with a surrogate loss $x\mapsto |x|^p$ with some $p<2$. (ii) Depending
on the variance of the data, there exists a \emph{`threshold of
heavy-tailedness'} such that the generalization error decreases as the tails
become heavier, as long as the tails are lighter than this threshold. This
suggests that the relation between heavy tails and generalization is not
globally monotonic. (iii) We prove matching lower-bounds on uniform stability,
implying that our bounds are tight in terms of the heaviness of the tails. We
support our theory with synthetic and real neural network experiments.
Related papers
- The Implicit Regularization of Dynamical Stability in Stochastic
Gradient Descent [32.25490196411385]
We study the implicit regularization of gradient descent (SGD) through the lens of em dynamical stability
We analyze the generalization properties of two-layer ReLU networks and diagonal linear networks.
arXiv Detail & Related papers (2023-05-27T14:54:21Z) - Algorithmic Stability of Heavy-Tailed SGD with General Loss Functions [13.431453056203226]
Heavy-tail phenomena in Wasserstein descent (SGD) have been reported several empirical observations.
This paper develops bounds for generalization functions as well as general gradient functions.
Very recently, they shed more light to the empirical observations, thanks to the generality of the loss functions.
arXiv Detail & Related papers (2023-01-27T17:57:35Z) - Clipped Stochastic Methods for Variational Inequalities with
Heavy-Tailed Noise [64.85879194013407]
We prove the first high-probability results with logarithmic dependence on the confidence level for methods for solving monotone and structured non-monotone VIPs.
Our results match the best-known ones in the light-tails case and are novel for structured non-monotone problems.
In addition, we numerically validate that the gradient noise of many practical formulations is heavy-tailed and show that clipping improves the performance of SEG/SGDA.
arXiv Detail & Related papers (2022-06-02T15:21:55Z) - Fat-Tailed Variational Inference with Anisotropic Tail Adaptive Flows [53.32246823168763]
Fat-tailed densities commonly arise as posterior and marginal distributions in robust models and scale mixtures.
We first improve previous theory on tails of Lipschitz flows by quantifying how tails affect the rate of tail decay.
We then develop an alternative theory for tail parameters which is sensitive to tail-anisotropy.
arXiv Detail & Related papers (2022-05-16T18:03:41Z) - Benign Underfitting of Stochastic Gradient Descent [72.38051710389732]
We study to what extent may gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit training data.
We analyze the closely related with-replacement SGD, for which an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate.
arXiv Detail & Related papers (2022-02-27T13:25:01Z) - On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD)
We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting.
We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z) - Benign Overfitting of Constant-Stepsize SGD for Linear Regression [122.70478935214128]
inductive biases are central in preventing overfitting empirically.
This work considers this issue in arguably the most basic setting: constant-stepsize SGD for linear regression.
We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares.
arXiv Detail & Related papers (2021-03-23T17:15:53Z) - Convergence Rates of Stochastic Gradient Descent under Infinite Noise
Variance [14.06947898164194]
Heavy tails emerge in gradient descent (SGD) in various scenarios.
We provide convergence guarantees for SGD under a state-dependent and heavy-tailed noise with a potentially infinite variance.
Our results indicate that even under heavy-tailed noise with infinite variance, SGD can converge to the global optimum.
arXiv Detail & Related papers (2021-02-20T13:45:11Z) - Toward Better Generalization Bounds with Locally Elastic Stability [41.7030651617752]
We argue that locally elastic stability implies tighter generalization bounds than those derived based on uniform stability.
We revisit the examples of bounded support vector machines, regularized least square regressions, and gradient descent.
arXiv Detail & Related papers (2020-10-27T02:04:53Z) - Fine-Grained Analysis of Stability and Generalization for Stochastic
Gradient Descent [55.85456985750134]
We introduce a new stability measure called on-average model stability, for which we develop novel bounds controlled by the risks of SGD iterates.
This yields generalization bounds depending on the behavior of the best model, and leads to the first-ever-known fast bounds in the low-noise setting.
To our best knowledge, this gives the firstever-known stability and generalization for SGD with even non-differentiable loss functions.
arXiv Detail & Related papers (2020-06-15T06:30:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.