On Minibatch Noise: Discrete-Time SGD, Overparametrization, and Bayes
- URL: http://arxiv.org/abs/2102.05375v1
- Date: Wed, 10 Feb 2021 10:38:55 GMT
- Title: On Minibatch Noise: Discrete-Time SGD, Overparametrization, and Bayes
- Authors: Liu Ziyin, Kangqiao Liu, Takashi Mori, Masahito Ueda
- Abstract summary: Noise in gradient descent (SGD) caused by minibatch sampling remains poorly understood.
Motivated by the observation that minibatch sampling does not always cause a fluctuation, we set out to find the conditions that cause minibatch noise to emerge.
- Score: 2.6763498831034043
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The noise in stochastic gradient descent (SGD), caused by minibatch sampling,
remains poorly understood despite its enormous practical importance in offering
good training efficiency and generalization ability. In this work, we study the
minibatch noise in SGD. Motivated by the observation that minibatch sampling
does not always cause a fluctuation, we set out to find the conditions that
cause minibatch noise to emerge. We first derive the analytically solvable
results for linear regression under various settings, which are compared to the
commonly used approximations that are used to understand SGD noise. We show
that some degree of mismatch between model and data complexity is needed in
order for SGD to "cause" a noise, and that such mismatch may be due to the
existence of static noise in the labels, in the input, the use of
regularization, or underparametrization. Our results motivate a more accurate
general formulation to describe minibatch noise.
Related papers
- Doubly Stochastic Models: Learning with Unbiased Label Noises and
Inference Stability [85.1044381834036]
We investigate the implicit regularization effects of label noises under mini-batch sampling settings of gradient descent.
We find such implicit regularizer would favor some convergence points that could stabilize model outputs against perturbation of parameters.
Our work doesn't assume SGD as an Ornstein-Uhlenbeck like process and achieve a more general result with convergence of approximation proved.
arXiv Detail & Related papers (2023-04-01T14:09:07Z) - Latent Class-Conditional Noise Model [54.56899309997246]
We introduce a Latent Class-Conditional Noise model (LCCN) to parameterize the noise transition under a Bayesian framework.
We then deduce a dynamic label regression method for LCCN, whose Gibbs sampler allows us efficiently infer the latent true labels.
Our approach safeguards the stable update of the noise transition, which avoids previous arbitrarily tuning from a mini-batch of samples.
arXiv Detail & Related papers (2023-02-19T15:24:37Z) - Clipped Stochastic Methods for Variational Inequalities with
Heavy-Tailed Noise [64.85879194013407]
We prove the first high-probability results with logarithmic dependence on the confidence level for methods for solving monotone and structured non-monotone VIPs.
Our results match the best-known ones in the light-tails case and are novel for structured non-monotone problems.
In addition, we numerically validate that the gradient noise of many practical formulations is heavy-tailed and show that clipping improves the performance of SEG/SGDA.
arXiv Detail & Related papers (2022-06-02T15:21:55Z) - The effective noise of Stochastic Gradient Descent [9.645196221785694]
Gradient Descent (SGD) is the workhorse algorithm of deep learning technology.
We characterize the parameters of SGD and a recently-introduced variant, persistent SGD, in a neural network model.
We find that noisier algorithms lead to wider decision boundaries of the corresponding constraint satisfaction problem.
arXiv Detail & Related papers (2021-12-20T20:46:19Z) - Label Noise SGD Provably Prefers Flat Global Minimizers [48.883469271546076]
In overparametrized models, the noise in gradient descent (SGD) implicitly regularizes the optimization trajectory and determines which local minimum SGD converges to.
We show that SGD with label noise converges to a stationary point of a regularized loss $L(theta) +lambda R(theta)$, where $L(theta)$ is the training loss.
Our analysis uncovers an additional regularization effect of large learning rates beyond the linear scaling rule that penalizes large eigenvalues of the Hessian more than small ones.
arXiv Detail & Related papers (2021-06-11T17:59:07Z) - Noisy Truncated SGD: Optimization and Generalization [27.33458360279836]
Recent empirical work on SGD has shown that most gradient components over epochs are quite small.
Inspired by such a study, we rigorously study properties of noisy SGD (NT-SGD)
We prove that NT-SGD can provably escape from saddle points and requires less noise compared to previous related work.
arXiv Detail & Related papers (2021-02-26T22:39:41Z) - Dynamic of Stochastic Gradient Descent with State-Dependent Noise [84.64013284862733]
gradient descent (SGD) and its variants are mainstream methods to train deep neural networks.
We show that the covariance of the noise of SGD in the local region of the local minima is a quadratic function of the state.
We propose a novel power-law dynamic with state-dependent diffusion to approximate the dynamic of SGD.
arXiv Detail & Related papers (2020-06-24T13:34:38Z) - Shape Matters: Understanding the Implicit Bias of the Noise Covariance [76.54300276636982]
Noise in gradient descent provides a crucial implicit regularization effect for training over parameterized models.
We show that parameter-dependent noise -- induced by mini-batches or label perturbation -- is far more effective than Gaussian noise.
Our analysis reveals that parameter-dependent noise introduces a bias towards local minima with smaller noise variance, whereas spherical Gaussian noise does not.
arXiv Detail & Related papers (2020-06-15T18:31:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.