Shape Matters: Understanding the Implicit Bias of the Noise Covariance
- URL: http://arxiv.org/abs/2006.08680v2
- Date: Thu, 18 Jun 2020 03:34:08 GMT
- Title: Shape Matters: Understanding the Implicit Bias of the Noise Covariance
- Authors: Jeff Z. HaoChen, Colin Wei, Jason D. Lee, Tengyu Ma
- Abstract summary: Noise in gradient descent provides a crucial implicit regularization effect for training over parameterized models.
We show that parameter-dependent noise -- induced by mini-batches or label perturbation -- is far more effective than Gaussian noise.
Our analysis reveals that parameter-dependent noise introduces a bias towards local minima with smaller noise variance, whereas spherical Gaussian noise does not.
- Score: 76.54300276636982
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The noise in stochastic gradient descent (SGD) provides a crucial implicit
regularization effect for training overparameterized models. Prior theoretical
work largely focuses on spherical Gaussian noise, whereas empirical studies
demonstrate the phenomenon that parameter-dependent noise -- induced by
mini-batches or label perturbation -- is far more effective than Gaussian
noise. This paper theoretically characterizes this phenomenon on a
quadratically-parameterized model introduced by Vaskevicius et el. and
Woodworth et el. We show that in an over-parameterized setting, SGD with label
noise recovers the sparse ground-truth with an arbitrary initialization,
whereas SGD with Gaussian noise or gradient descent overfits to dense solutions
with large norms. Our analysis reveals that parameter-dependent noise
introduces a bias towards local minima with smaller noise variance, whereas
spherical Gaussian noise does not. Code for our project is publicly available.
Related papers
- Doubly Stochastic Models: Learning with Unbiased Label Noises and
Inference Stability [85.1044381834036]
We investigate the implicit regularization effects of label noises under mini-batch sampling settings of gradient descent.
We find such implicit regularizer would favor some convergence points that could stabilize model outputs against perturbation of parameters.
Our work doesn't assume SGD as an Ornstein-Uhlenbeck like process and achieve a more general result with convergence of approximation proved.
arXiv Detail & Related papers (2023-04-01T14:09:07Z) - Latent Class-Conditional Noise Model [54.56899309997246]
We introduce a Latent Class-Conditional Noise model (LCCN) to parameterize the noise transition under a Bayesian framework.
We then deduce a dynamic label regression method for LCCN, whose Gibbs sampler allows us efficiently infer the latent true labels.
Our approach safeguards the stable update of the noise transition, which avoids previous arbitrarily tuning from a mini-batch of samples.
arXiv Detail & Related papers (2023-02-19T15:24:37Z) - The Optimal Noise in Noise-Contrastive Learning Is Not What You Think [80.07065346699005]
We show that deviating from this assumption can actually lead to better statistical estimators.
In particular, the optimal noise distribution is different from the data's and even from a different family.
arXiv Detail & Related papers (2022-03-02T13:59:20Z) - Multiview point cloud registration with anisotropic and space-varying
localization noise [1.5499426028105903]
We address the problem of registering multiple point clouds corrupted with high anisotropic localization noise.
Existing methods are based on an implicit assumption of space-invariant isotropic noise.
We show that our noise handling strategy improves significantly the robustness to high levels of anisotropic noise.
arXiv Detail & Related papers (2022-01-03T15:21:24Z) - Optimizing Information-theoretical Generalization Bounds via Anisotropic
Noise in SGLD [73.55632827932101]
We optimize the information-theoretical generalization bound by manipulating the noise structure in SGLD.
We prove that with constraint to guarantee low empirical risk, the optimal noise covariance is the square root of the expected gradient covariance.
arXiv Detail & Related papers (2021-10-26T15:02:27Z) - Revisiting the Characteristics of Stochastic Gradient Noise and Dynamics [25.95229631113089]
We show that the gradient noise possesses finite variance, and therefore the Central Limit Theorem (CLT) applies.
We then demonstrate the existence of the steady-state distribution of gradient descent and approximate the distribution at a small learning rate.
arXiv Detail & Related papers (2021-09-20T20:39:14Z) - Asymmetric Heavy Tails and Implicit Bias in Gaussian Noise Injections [73.95786440318369]
We focus on the so-called implicit effect' of GNIs, which is the effect of the injected noise on the dynamics of gradient descent (SGD)
We show that this effect induces an asymmetric heavy-tailed noise on gradient updates.
We then formally prove that GNIs induce an implicit bias', which varies depending on the heaviness of the tails and the level of asymmetry.
arXiv Detail & Related papers (2021-02-13T21:28:09Z) - On Minibatch Noise: Discrete-Time SGD, Overparametrization, and Bayes [2.6763498831034043]
Noise in gradient descent (SGD) caused by minibatch sampling remains poorly understood.
Motivated by the observation that minibatch sampling does not always cause a fluctuation, we set out to find the conditions that cause minibatch noise to emerge.
arXiv Detail & Related papers (2021-02-10T10:38:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.