Algorithmic Stability of Heavy-Tailed SGD with General Loss Functions
- URL: http://arxiv.org/abs/2301.11885v2
- Date: Mon, 30 Jan 2023 05:14:51 GMT
- Title: Algorithmic Stability of Heavy-Tailed SGD with General Loss Functions
- Authors: Anant Raj and Lingjiong Zhu and Mert G\"urb\"uzbalaban and Umut
\c{S}im\c{s}ekli
- Abstract summary: Heavy-tail phenomena in Wasserstein descent (SGD) have been reported several empirical observations.
This paper develops bounds for generalization functions as well as general gradient functions.
Very recently, they shed more light to the empirical observations, thanks to the generality of the loss functions.
- Score: 13.431453056203226
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Heavy-tail phenomena in stochastic gradient descent (SGD) have been reported
in several empirical studies. Experimental evidence in previous works suggests
a strong interplay between the heaviness of the tails and generalization
behavior of SGD. To address this empirical phenomena theoretically, several
works have made strong topological and statistical assumptions to link the
generalization error to heavy tails. Very recently, new generalization bounds
have been proven, indicating a non-monotonic relationship between the
generalization error and heavy tails, which is more pertinent to the reported
empirical observations. While these bounds do not require additional
topological assumptions given that SGD can be modeled using a heavy-tailed
stochastic differential equation (SDE), they can only apply to simple quadratic
problems. In this paper, we build on this line of research and develop
generalization bounds for a more general class of objective functions, which
includes non-convex functions as well. Our approach is based on developing
Wasserstein stability bounds for heavy-tailed SDEs and their discretizations,
which we then convert to generalization bounds. Our results do not require any
nontrivial assumptions; yet, they shed more light to the empirical
observations, thanks to the generality of the loss functions.
Related papers
- Equivariance and partial observations in Koopman operator theory for partial differential equations [1.099532646524593]
We show that symmetries in the system dynamics can be carried over to the Koopman operator.
We address the highly-relevant case where we cannot measure the full state.
arXiv Detail & Related papers (2023-07-28T06:03:19Z) - Large deviations rates for stochastic gradient descent with strongly
convex functions [11.247580943940916]
We provide a formal framework for the study of general high probability bounds with gradient descent.
We find an upper large deviations bound for SGD with strongly convex functions.
arXiv Detail & Related papers (2022-11-02T09:15:26Z) - Algorithmic Stability of Heavy-Tailed Stochastic Gradient Descent on
Least Squares [12.2950446921662]
Recent studies have shown that heavy tails can emerge in optimization and that the heaviness of the tails has links to the generalization error.
We establish novel links between the tail behavior and generalization properties of gradient descent (SGD) through the lens of algorithmic stability.
We support our theory with synthetic and real neural network experiments.
arXiv Detail & Related papers (2022-06-02T19:59:48Z) - Clipped Stochastic Methods for Variational Inequalities with
Heavy-Tailed Noise [64.85879194013407]
We prove the first high-probability results with logarithmic dependence on the confidence level for methods for solving monotone and structured non-monotone VIPs.
Our results match the best-known ones in the light-tails case and are novel for structured non-monotone problems.
In addition, we numerically validate that the gradient noise of many practical formulations is heavy-tailed and show that clipping improves the performance of SEG/SGDA.
arXiv Detail & Related papers (2022-06-02T15:21:55Z) - Benign Underfitting of Stochastic Gradient Descent [72.38051710389732]
We study to what extent may gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit training data.
We analyze the closely related with-replacement SGD, for which an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate.
arXiv Detail & Related papers (2022-02-27T13:25:01Z) - Generalization Bounds for Stochastic Gradient Langevin Dynamics: A
Unified View via Information Leakage Analysis [49.402932368689775]
We present a unified generalization from privacy leakage analysis to investigate the bounds of SGLD.
We also conduct various numerical minimization to assess the information leakage issue SGLD.
arXiv Detail & Related papers (2021-12-14T06:45:52Z) - On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD)
We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting.
We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z) - Global Convergence and Stability of Stochastic Gradient Descent [0.0]
We show that SGD converges to a stationary point under nearly arbitrary non-ity and noise models.
We show that SGD can be applied to a range of problems with global convergence of confidence.
arXiv Detail & Related papers (2021-10-04T19:00:50Z) - Deconfounded Score Method: Scoring DAGs with Dense Unobserved
Confounding [101.35070661471124]
We show that unobserved confounding leaves a characteristic footprint in the observed data distribution that allows for disentangling spurious and causal effects.
We propose an adjusted score-based causal discovery algorithm that may be implemented with general-purpose solvers and scales to high-dimensional problems.
arXiv Detail & Related papers (2021-03-28T11:07:59Z) - On the Generalization of Stochastic Gradient Descent with Momentum [58.900860437254885]
We first show that there exists a convex loss function for which algorithmic stability fails to establish generalization guarantees.
For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, and show that it admits an upper-bound on the generalization error.
For the special case of strongly convex loss functions, we find a range of momentum such that multiple epochs of standard SGDM, as a special form of SGDEM, also generalizes.
arXiv Detail & Related papers (2021-02-26T18:58:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.