Improved generalization by noise enhancement
- URL: http://arxiv.org/abs/2009.13094v1
- Date: Mon, 28 Sep 2020 06:29:23 GMT
- Title: Improved generalization by noise enhancement
- Authors: Takashi Mori, Masahito Ueda
- Abstract summary: Noise in gradient descent (SGD) is closely related to generalization.
We propose a method that achieves this goal using noise enhancement''
It turns out that large-batch training with the noise enhancement even shows better generalization compared with small-batch training.
- Score: 5.33024001730262
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent studies have demonstrated that noise in stochastic gradient descent
(SGD) is closely related to generalization: A larger SGD noise, if not too
large, results in better generalization. Since the covariance of the SGD noise
is proportional to $\eta^2/B$, where $\eta$ is the learning rate and $B$ is the
minibatch size of SGD, the SGD noise has so far been controlled by changing
$\eta$ and/or $B$. However, too large $\eta$ results in instability in the
training dynamics and a small $B$ prevents scalable parallel computation. It is
thus desirable to develop a method of controlling the SGD noise without
changing $\eta$ and $B$. In this paper, we propose a method that achieves this
goal using ``noise enhancement'', which is easily implemented in practice. We
expound the underlying theoretical idea and demonstrate that the noise
enhancement actually improves generalization for real datasets. It turns out
that large-batch training with the noise enhancement even shows better
generalization compared with small-batch training.
Related papers
- Latent Class-Conditional Noise Model [54.56899309997246]
We introduce a Latent Class-Conditional Noise model (LCCN) to parameterize the noise transition under a Bayesian framework.
We then deduce a dynamic label regression method for LCCN, whose Gibbs sampler allows us efficiently infer the latent true labels.
Our approach safeguards the stable update of the noise transition, which avoids previous arbitrarily tuning from a mini-batch of samples.
arXiv Detail & Related papers (2023-02-19T15:24:37Z) - Dissecting the Effects of SGD Noise in Distinct Regimes of Deep Learning [3.0222726254970174]
Noise in gradient descent affects generalization of deep neural networks.
We show that SGD noise can be detrimental or instead useful depending on the training regime.
arXiv Detail & Related papers (2023-01-31T15:22:24Z) - Identifying Hard Noise in Long-Tailed Sample Distribution [76.16113794808001]
We introduce Noisy Long-Tailed Classification (NLT)
Most de-noising methods fail to identify the hard noises.
We design an iterative noisy learning framework called Hard-to-Easy (H2E)
arXiv Detail & Related papers (2022-07-27T09:03:03Z) - Label Noise SGD Provably Prefers Flat Global Minimizers [48.883469271546076]
In overparametrized models, the noise in gradient descent (SGD) implicitly regularizes the optimization trajectory and determines which local minimum SGD converges to.
We show that SGD with label noise converges to a stationary point of a regularized loss $L(theta) +lambda R(theta)$, where $L(theta)$ is the training loss.
Our analysis uncovers an additional regularization effect of large learning rates beyond the linear scaling rule that penalizes large eigenvalues of the Hessian more than small ones.
arXiv Detail & Related papers (2021-06-11T17:59:07Z) - On Minibatch Noise: Discrete-Time SGD, Overparametrization, and Bayes [2.6763498831034043]
Noise in gradient descent (SGD) caused by minibatch sampling remains poorly understood.
Motivated by the observation that minibatch sampling does not always cause a fluctuation, we set out to find the conditions that cause minibatch noise to emerge.
arXiv Detail & Related papers (2021-02-10T10:38:55Z) - SGD Generalizes Better Than GD (And Regularization Doesn't Help) [39.588906680621825]
We give a new separation result between the generalization performance of gradient descent (SGD) and of full-batch gradient descent (GD)
We show that with the same number of steps GD may overfit and emit a solution with $Omega(1)$ generalization error.
We discuss how regularizing the empirical risk minimized by GD essentially does not change the above result, and revisit the concepts of stability, implicit bias and the role of the learning algorithm in generalization.
arXiv Detail & Related papers (2021-02-01T19:18:40Z) - Towards Theoretically Understanding Why SGD Generalizes Better Than ADAM
in Deep Learning [165.47118387176607]
It is not clear yet why ADAM-alike adaptive gradient algorithms suffer from worse generalization performance than SGD despite their faster training speed.
Specifically, we observe the heavy tails of gradient noise in these algorithms.
arXiv Detail & Related papers (2020-10-12T12:00:26Z) - Dynamic of Stochastic Gradient Descent with State-Dependent Noise [84.64013284862733]
gradient descent (SGD) and its variants are mainstream methods to train deep neural networks.
We show that the covariance of the noise of SGD in the local region of the local minima is a quadratic function of the state.
We propose a novel power-law dynamic with state-dependent diffusion to approximate the dynamic of SGD.
arXiv Detail & Related papers (2020-06-24T13:34:38Z) - Shape Matters: Understanding the Implicit Bias of the Noise Covariance [76.54300276636982]
Noise in gradient descent provides a crucial implicit regularization effect for training over parameterized models.
We show that parameter-dependent noise -- induced by mini-batches or label perturbation -- is far more effective than Gaussian noise.
Our analysis reveals that parameter-dependent noise introduces a bias towards local minima with smaller noise variance, whereas spherical Gaussian noise does not.
arXiv Detail & Related papers (2020-06-15T18:31:02Z) - Inherent Noise in Gradient Based Methods [3.0712335337791288]
Noise and its effect on robustness to perturbations has been linked to generalization.
We show that this noise penalizes models that are sensitive to perturbations in the weights.
We find that penalties are most pronounced for batches that are currently being used to update, and are higher for larger models.
arXiv Detail & Related papers (2020-05-26T14:12:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.