Related papers: Dissecting the Effects of SGD Noise in Distinct Regimes of Deep Learning

Dissecting the Effects of SGD Noise in Distinct Regimes of Deep Learning

URL: http://arxiv.org/abs/2301.13703v2
Date: Tue, 30 May 2023 12:21:35 GMT
Title: Dissecting the Effects of SGD Noise in Distinct Regimes of Deep Learning
Authors: Antonio Sclocchi, Mario Geiger, Matthieu Wyart
Abstract summary: Noise in gradient descent affects generalization of deep neural networks. We show that SGD noise can be detrimental or instead useful depending on the training regime.
Score: 3.0222726254970174
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Understanding when the noise in stochastic gradient descent (SGD) affects generalization of deep neural networks remains a challenge, complicated by the fact that networks can operate in distinct training regimes. Here we study how the magnitude of this noise $T$ affects performance as the size of the training set $P$ and the scale of initialization $\alpha$ are varied. For gradient descent, $\alpha$ is a key parameter that controls if the network is `lazy'($\alpha\gg1$) or instead learns features ($\alpha\ll1$). For classification of MNIST and CIFAR10 images, our central results are: (i) obtaining phase diagrams for performance in the $(\alpha,T)$ plane. They show that SGD noise can be detrimental or instead useful depending on the training regime. Moreover, although increasing $T$ or decreasing $\alpha$ both allow the net to escape the lazy regime, these changes can have opposite effects on performance. (ii) Most importantly, we find that the characteristic temperature $T_c$ where the noise of SGD starts affecting the trained model (and eventually performance) is a power law of $P$. We relate this finding with the observation that key dynamical quantities, such as the total variation of weights during training, depend on both $T$ and $P$ as power laws. These results indicate that a key effect of SGD noise occurs late in training by affecting the stopping process whereby all data are fitted. Indeed, we argue that due to SGD noise, nets must develop a stronger `signal', i.e. larger informative weights, to fit the data, leading to a longer training time. A stronger signal and a longer training time are also required when the size of the training set $P$ increases. We confirm these views in the perceptron model, where signal and noise can be precisely measured. Interestingly, exponents characterizing the effect of SGD depend on the density of data near the decision boundary, as we explain.

Related papers

The Optimization Landscape of SGD Across the Feature Learning Strength [102.1353410293931]
We study the effect of scaling $gamma$ across a variety of models and datasets in the online training setting. We find that optimal online performance is often found at large $gamma$. Our findings indicate that analytical study of the large-$gamma$ limit may yield useful insights into the dynamics of representation learning in performant models.
arXiv Detail & Related papers (2024-10-06T22:30:14Z)
Online Learning and Information Exponents: On The Importance of Batch size, and Time/Complexity Tradeoffs [24.305423716384272]
We study the impact of the batch size on the iteration time $T$ of training two-layer neural networks with one-pass gradient descent (SGD) We show that performing gradient updates with large batches minimizes the training time without changing the total sample complexity. We show that one can track the training progress by a system of low-dimensional ordinary differential equations (ODEs)
arXiv Detail & Related papers (2024-06-04T09:44:49Z)
Learning with Noisy Foundation Models [95.50968225050012]
This paper is the first work to comprehensively understand and analyze the nature of noise in pre-training datasets. We propose a tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization.
arXiv Detail & Related papers (2024-03-11T16:22:41Z)
Understanding and Mitigating the Label Noise in Pre-training on Downstream Tasks [91.15120211190519]
This paper aims to understand the nature of noise in pre-training datasets and to mitigate its impact on downstream tasks. We propose a light-weight black-box tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise.
arXiv Detail & Related papers (2023-09-29T06:18:15Z)
Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit [36.17720004582283]
This work conducts such an exploration through the lens of learning $k$-sparse parities of $n$ bits. We find that neural networks exhibit surprising phase transitions when scaling up dataset size and running time.
arXiv Detail & Related papers (2022-07-18T17:55:05Z)
Robust Training of Neural Networks using Scale Invariant Architectures [70.67803417918854]
In contrast to SGD, adaptive gradient methods like Adam allow robust training of modern deep networks. We show that this general approach is robust to rescaling of parameter and loss. We design a scale invariant version of BERT, called SIBERT, which when trained simply by vanilla SGD achieves performance comparable to BERT trained by adaptive methods like Adam.
arXiv Detail & Related papers (2022-02-02T11:58:56Z)
Dynamics of Local Elasticity During Training of Neural Nets [7.9140338281956835]
"Local elasticity" attempts to quantify the propagation of the influence of a sampled data point on the prediction at another data. We show that our new proposal of $S_rm rel$, as opposed to the original definition, much more sharply detects the property of the weight updates.
arXiv Detail & Related papers (2021-11-01T18:00:14Z)
Label Noise SGD Provably Prefers Flat Global Minimizers [48.883469271546076]
In overparametrized models, the noise in gradient descent (SGD) implicitly regularizes the optimization trajectory and determines which local minimum SGD converges to. We show that SGD with label noise converges to a stationary point of a regularized loss $L(theta) +lambda R(theta)$, where $L(theta)$ is the training loss. Our analysis uncovers an additional regularization effect of large learning rates beyond the linear scaling rule that penalizes large eigenvalues of the Hessian more than small ones.
arXiv Detail & Related papers (2021-06-11T17:59:07Z)
Improved generalization by noise enhancement [5.33024001730262]
Noise in gradient descent (SGD) is closely related to generalization. We propose a method that achieves this goal using noise enhancement'' It turns out that large-batch training with the noise enhancement even shows better generalization compared with small-batch training.
arXiv Detail & Related papers (2020-09-28T06:29:23Z)
Understanding the Difficulty of Training Transformers [120.99980924577787]
We show that unbalanced gradients are not the root cause of the instability of training. We propose Admin to stabilize the early stage's training and unleash its full potential in the late stage.
arXiv Detail & Related papers (2020-04-17T13:59:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.