Approximate Heavy Tails in Offline (Multi-Pass) Stochastic Gradient
Descent
- URL: http://arxiv.org/abs/2310.18455v1
- Date: Fri, 27 Oct 2023 20:06:03 GMT
- Title: Approximate Heavy Tails in Offline (Multi-Pass) Stochastic Gradient
Descent
- Authors: Krunoslav Lehman Pavasovic, Alain Durmus, Umut Simsekli
- Abstract summary: We show that the stationary distribution of offline (also called multi-pass) SGD exhibits 'approximate' power-law tails.
Our main takeaway is that, as the number of data points increases, offline SGD will behave increasingly 'power-law-like'
- Score: 33.9917975060585
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A recent line of empirical studies has demonstrated that SGD might exhibit a
heavy-tailed behavior in practical settings, and the heaviness of the tails
might correlate with the overall performance. In this paper, we investigate the
emergence of such heavy tails. Previous works on this problem only considered,
up to our knowledge, online (also called single-pass) SGD, in which the
emergence of heavy tails in theoretical findings is contingent upon access to
an infinite amount of data. Hence, the underlying mechanism generating the
reported heavy-tailed behavior in practical settings, where the amount of
training data is finite, is still not well-understood. Our contribution aims to
fill this gap. In particular, we show that the stationary distribution of
offline (also called multi-pass) SGD exhibits 'approximate' power-law tails and
the approximation error is controlled by how fast the empirical distribution of
the training data converges to the true underlying data distribution in the
Wasserstein metric. Our main takeaway is that, as the number of data points
increases, offline SGD will behave increasingly 'power-law-like'. To achieve
this result, we first prove nonasymptotic Wasserstein convergence bounds for
offline SGD to online SGD as the number of data points increases, which can be
interesting on their own. Finally, we illustrate our theory on various
experiments conducted on synthetic data and neural networks.
Related papers
- Accuracy on the wrong line: On the pitfalls of noisy data for out-of-distribution generalisation [70.36344590967519]
We show that noisy data and nuisance features can be sufficient to shatter the Accuracy-on-the-line phenomenon.
We demonstrate this phenomenon across both synthetic and real datasets with noisy data and nuisance features.
arXiv Detail & Related papers (2024-06-27T09:57:31Z) - Statistical Inference for Linear Functionals of Online SGD in High-dimensional Linear Regression [14.521929085104441]
We establish a high-dimensional Central Limit Theorem (CLT) for linear functionals of online gradient descent (SGD)
We develop an online approach for estimating the expectation and the variance terms appearing in the CLT, and establish high-probability bounds for the developed online estimator.
We propose a two-step fully online bias-correction methodology which together with the CLT result and the variance estimation result, provides a fully online and data-driven way to numerically construct confidence intervals.
arXiv Detail & Related papers (2023-02-20T02:38:36Z) - Heavy-Tail Phenomenon in Decentralized SGD [33.63000461985398]
We study the emergence of heavy-tails in decentralized gradient descent (DE-SGD)
We also investigate the effect of decentralization on the tail behavior.
Our theory uncovers an interesting interplay between the tails and the network structure.
arXiv Detail & Related papers (2022-05-13T14:47:04Z) - An Empirical Study of the Occurrence of Heavy-Tails in Training a ReLU
Gate [0.7614628596146599]
We conjecture that two algorithms have similar heavy-tail behaviour on any data where the latter can be proven to converge.
We demonstrate that the heavy-tail index of the late time iterates in this model scenario has strikingly different properties than either what has been proven for linear hypothesis classes or what has been previously demonstrated for large nets.
arXiv Detail & Related papers (2022-04-26T19:28:51Z) - BCD Nets: Scalable Variational Approaches for Bayesian Causal Discovery [97.79015388276483]
A structural equation model (SEM) is an effective framework to reason over causal relationships represented via a directed acyclic graph (DAG)
Recent advances enabled effective maximum-likelihood point estimation of DAGs from observational data.
We propose BCD Nets, a variational framework for estimating a distribution over DAGs characterizing a linear-Gaussian SEM.
arXiv Detail & Related papers (2021-12-06T03:35:21Z) - Deconfounded Score Method: Scoring DAGs with Dense Unobserved
Confounding [101.35070661471124]
We show that unobserved confounding leaves a characteristic footprint in the observed data distribution that allows for disentangling spurious and causal effects.
We propose an adjusted score-based causal discovery algorithm that may be implemented with general-purpose solvers and scales to high-dimensional problems.
arXiv Detail & Related papers (2021-03-28T11:07:59Z) - Understanding Gradient Clipping in Private SGD: A Geometric Perspective [68.61254575987013]
Deep learning models are increasingly popular in many machine learning applications where the training data may contain sensitive information.
Many learning systems now incorporate differential privacy by training their models with (differentially) private SGD.
A key step in each private SGD update is gradient clipping that shrinks the gradient of an individual example whenever its L2 norm exceeds some threshold.
arXiv Detail & Related papers (2020-06-27T19:08:12Z) - Provably Efficient Causal Reinforcement Learning with Confounded
Observational Data [135.64775986546505]
We study how to incorporate the dataset (observational data) collected offline, which is often abundantly available in practice, to improve the sample efficiency in the online setting.
We propose the deconfounded optimistic value iteration (DOVI) algorithm, which incorporates the confounded observational data in a provably efficient manner.
arXiv Detail & Related papers (2020-06-22T14:49:33Z) - The Heavy-Tail Phenomenon in SGD [7.366405857677226]
We show that depending on the structure of the Hessian of the loss at the minimum, the SGD iterates will converge to a emphheavy-tailed stationary distribution.
We translate our results into insights about the behavior of SGD in deep learning.
arXiv Detail & Related papers (2020-06-08T16:43:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.