Inherent Noise in Gradient Based Methods
- URL: http://arxiv.org/abs/2005.12743v1
- Date: Tue, 26 May 2020 14:12:22 GMT
- Title: Inherent Noise in Gradient Based Methods
- Authors: Arushi Gupta
- Abstract summary: Noise and its effect on robustness to perturbations has been linked to generalization.
We show that this noise penalizes models that are sensitive to perturbations in the weights.
We find that penalties are most pronounced for batches that are currently being used to update, and are higher for larger models.
- Score: 3.0712335337791288
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Previous work has examined the ability of larger capacity neural networks to
generalize better than smaller ones, even without explicit regularizers, by
analyzing gradient based algorithms such as GD and SGD. The presence of noise
and its effect on robustness to parameter perturbations has been linked to
generalization. We examine a property of GD and SGD, namely that instead of
iterating through all scalar weights in the network and updating them one by
one, GD (and SGD) updates all the parameters at the same time. As a result,
each parameter $w^i$ calculates its partial derivative at the stale parameter
$\mathbf{w_t}$, but then suffers loss $\hat{L}(\mathbf{w_{t+1}})$. We show that
this causes noise to be introduced into the optimization. We find that this
noise penalizes models that are sensitive to perturbations in the weights. We
find that penalties are most pronounced for batches that are currently being
used to update, and are higher for larger models.
Related papers
- Scaling Laws in Linear Regression: Compute, Parameters, and Data [86.48154162485712]
We study the theory of scaling laws in an infinite dimensional linear regression setup.
We show that the reducible part of the test error is $Theta(-(a-1) + N-(a-1)/a)$.
Our theory is consistent with the empirical neural scaling laws and verified by numerical simulation.
arXiv Detail & Related papers (2024-06-12T17:53:29Z) - On Convergence of Adam for Stochastic Optimization under Relaxed
Assumptions [4.9495085874952895]
Adaptive Momentum Estimation (Adam) algorithm is highly effective in various deep learning tasks.
We show that Adam can find a stationary point variance with a rate in high iterations under this general noise model.
arXiv Detail & Related papers (2024-02-06T13:19:26Z) - Some Constructions of Private, Efficient, and Optimal $K$-Norm and Elliptic Gaussian Noise [54.34628844260993]
Differentially private computation often begins with a bound on some $d$-dimensional statistic's sensitivity.
For pure differential privacy, the $K$-norm mechanism can improve on this approach using a norm tailored to the statistic's sensitivity space.
This paper solves both problems for the simple statistics of sum, count, and vote.
arXiv Detail & Related papers (2023-09-27T17:09:36Z) - Noise Regularizes Over-parameterized Rank One Matrix Recovery, Provably [42.427869499882206]
We parameterize the rank one matrix $Y*$ by $XXtop$, where $Xin Rdtimes d$.
We then show that under mild conditions, the estimator, obtained by the randomly perturbed gradient descent algorithm using the square loss function, attains a mean square error of $O(sigma2/d)$.
In contrast, the estimator obtained by gradient descent without random perturbation only attains a mean square error of $O(sigma2)$.
arXiv Detail & Related papers (2022-02-07T21:53:51Z) - Correcting Momentum with Second-order Information [50.992629498861724]
We develop a new algorithm for non-critical optimization that finds an $O(epsilon)$epsilon point in the optimal product.
We validate our results on a variety of large-scale deep learning benchmarks and architectures.
arXiv Detail & Related papers (2021-03-04T19:01:20Z) - Asymmetric Heavy Tails and Implicit Bias in Gaussian Noise Injections [73.95786440318369]
We focus on the so-called implicit effect' of GNIs, which is the effect of the injected noise on the dynamics of gradient descent (SGD)
We show that this effect induces an asymmetric heavy-tailed noise on gradient updates.
We then formally prove that GNIs induce an implicit bias', which varies depending on the heaviness of the tails and the level of asymmetry.
arXiv Detail & Related papers (2021-02-13T21:28:09Z) - Improved generalization by noise enhancement [5.33024001730262]
Noise in gradient descent (SGD) is closely related to generalization.
We propose a method that achieves this goal using noise enhancement''
It turns out that large-batch training with the noise enhancement even shows better generalization compared with small-batch training.
arXiv Detail & Related papers (2020-09-28T06:29:23Z) - Shape Matters: Understanding the Implicit Bias of the Noise Covariance [76.54300276636982]
Noise in gradient descent provides a crucial implicit regularization effect for training over parameterized models.
We show that parameter-dependent noise -- induced by mini-batches or label perturbation -- is far more effective than Gaussian noise.
Our analysis reveals that parameter-dependent noise introduces a bias towards local minima with smaller noise variance, whereas spherical Gaussian noise does not.
arXiv Detail & Related papers (2020-06-15T18:31:02Z) - AdamP: Slowing Down the Slowdown for Momentum Optimizers on
Scale-invariant Weights [53.8489656709356]
Normalization techniques are a boon for modern deep learning.
It is often overlooked, however, that the additional introduction of momentum results in a rapid reduction in effective step sizes for scale-invariant weights.
In this paper, we verify that the widely-adopted combination of the two ingredients lead to the premature decay of effective step sizes and sub-optimal model performances.
arXiv Detail & Related papers (2020-06-15T08:35:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.