Dissecting the Effects of SGD Noise in Distinct Regimes of Deep Learning
- URL: http://arxiv.org/abs/2301.13703v2
- Date: Tue, 30 May 2023 12:21:35 GMT
- Title: Dissecting the Effects of SGD Noise in Distinct Regimes of Deep Learning
- Authors: Antonio Sclocchi, Mario Geiger, Matthieu Wyart
- Abstract summary: Noise in gradient descent affects generalization of deep neural networks.
We show that SGD noise can be detrimental or instead useful depending on the training regime.
- Score: 3.0222726254970174
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding when the noise in stochastic gradient descent (SGD) affects
generalization of deep neural networks remains a challenge, complicated by the
fact that networks can operate in distinct training regimes. Here we study how
the magnitude of this noise $T$ affects performance as the size of the training
set $P$ and the scale of initialization $\alpha$ are varied. For gradient
descent, $\alpha$ is a key parameter that controls if the network is
`lazy'($\alpha\gg1$) or instead learns features ($\alpha\ll1$). For
classification of MNIST and CIFAR10 images, our central results are: (i)
obtaining phase diagrams for performance in the $(\alpha,T)$ plane. They show
that SGD noise can be detrimental or instead useful depending on the training
regime. Moreover, although increasing $T$ or decreasing $\alpha$ both allow the
net to escape the lazy regime, these changes can have opposite effects on
performance. (ii) Most importantly, we find that the characteristic temperature
$T_c$ where the noise of SGD starts affecting the trained model (and eventually
performance) is a power law of $P$. We relate this finding with the observation
that key dynamical quantities, such as the total variation of weights during
training, depend on both $T$ and $P$ as power laws. These results indicate that
a key effect of SGD noise occurs late in training by affecting the stopping
process whereby all data are fitted. Indeed, we argue that due to SGD noise,
nets must develop a stronger `signal', i.e. larger informative weights, to fit
the data, leading to a longer training time. A stronger signal and a longer
training time are also required when the size of the training set $P$
increases. We confirm these views in the perceptron model, where signal and
noise can be precisely measured. Interestingly, exponents characterizing the
effect of SGD depend on the density of data near the decision boundary, as we
explain.
Related papers
- The Optimization Landscape of SGD Across the Feature Learning Strength [102.1353410293931]
We study the effect of scaling $gamma$ across a variety of models and datasets in the online training setting.
We find that optimal online performance is often found at large $gamma$.
Our findings indicate that analytical study of the large-$gamma$ limit may yield useful insights into the dynamics of representation learning in performant models.
arXiv Detail & Related papers (2024-10-06T22:30:14Z) - Online Learning and Information Exponents: On The Importance of Batch size, and Time/Complexity Tradeoffs [24.305423716384272]
We study the impact of the batch size on the iteration time $T$ of training two-layer neural networks with one-pass gradient descent (SGD)
We show that performing gradient updates with large batches minimizes the training time without changing the total sample complexity.
We show that one can track the training progress by a system of low-dimensional ordinary differential equations (ODEs)
arXiv Detail & Related papers (2024-06-04T09:44:49Z) - Learning with Noisy Foundation Models [95.50968225050012]
This paper is the first work to comprehensively understand and analyze the nature of noise in pre-training datasets.
We propose a tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization.
arXiv Detail & Related papers (2024-03-11T16:22:41Z) - Understanding and Mitigating the Label Noise in Pre-training on
Downstream Tasks [91.15120211190519]
This paper aims to understand the nature of noise in pre-training datasets and to mitigate its impact on downstream tasks.
We propose a light-weight black-box tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise.
arXiv Detail & Related papers (2023-09-29T06:18:15Z) - Hidden Progress in Deep Learning: SGD Learns Parities Near the
Computational Limit [36.17720004582283]
This work conducts such an exploration through the lens of learning $k$-sparse parities of $n$ bits.
We find that neural networks exhibit surprising phase transitions when scaling up dataset size and running time.
arXiv Detail & Related papers (2022-07-18T17:55:05Z) - Robust Training of Neural Networks using Scale Invariant Architectures [70.67803417918854]
In contrast to SGD, adaptive gradient methods like Adam allow robust training of modern deep networks.
We show that this general approach is robust to rescaling of parameter and loss.
We design a scale invariant version of BERT, called SIBERT, which when trained simply by vanilla SGD achieves performance comparable to BERT trained by adaptive methods like Adam.
arXiv Detail & Related papers (2022-02-02T11:58:56Z) - Dynamics of Local Elasticity During Training of Neural Nets [7.9140338281956835]
"Local elasticity" attempts to quantify the propagation of the influence of a sampled data point on the prediction at another data.
We show that our new proposal of $S_rm rel$, as opposed to the original definition, much more sharply detects the property of the weight updates.
arXiv Detail & Related papers (2021-11-01T18:00:14Z) - Label Noise SGD Provably Prefers Flat Global Minimizers [48.883469271546076]
In overparametrized models, the noise in gradient descent (SGD) implicitly regularizes the optimization trajectory and determines which local minimum SGD converges to.
We show that SGD with label noise converges to a stationary point of a regularized loss $L(theta) +lambda R(theta)$, where $L(theta)$ is the training loss.
Our analysis uncovers an additional regularization effect of large learning rates beyond the linear scaling rule that penalizes large eigenvalues of the Hessian more than small ones.
arXiv Detail & Related papers (2021-06-11T17:59:07Z) - Improved generalization by noise enhancement [5.33024001730262]
Noise in gradient descent (SGD) is closely related to generalization.
We propose a method that achieves this goal using noise enhancement''
It turns out that large-batch training with the noise enhancement even shows better generalization compared with small-batch training.
arXiv Detail & Related papers (2020-09-28T06:29:23Z) - Understanding the Difficulty of Training Transformers [120.99980924577787]
We show that unbalanced gradients are not the root cause of the instability of training.
We propose Admin to stabilize the early stage's training and unleash its full potential in the late stage.
arXiv Detail & Related papers (2020-04-17T13:59:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.