Benign Oscillation of Stochastic Gradient Descent with Large Learning
Rates
- URL: http://arxiv.org/abs/2310.17074v1
- Date: Thu, 26 Oct 2023 00:35:40 GMT
- Title: Benign Oscillation of Stochastic Gradient Descent with Large Learning
Rates
- Authors: Miao Lu, Beining Wu, Xiaodong Yang, Difan Zou
- Abstract summary: We investigate the generalization properties of neural networks (NN) trained by gradient descent (SGD) algorithm with large learning rates.
Under such a training regime, our finding is that, the oscillation of the NN weights caused by the large learning rate SGD training turns out to be beneficial to the generalization of the NN.
- Score: 21.8377731053374
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we theoretically investigate the generalization properties of
neural networks (NN) trained by stochastic gradient descent (SGD) algorithm
with large learning rates. Under such a training regime, our finding is that,
the oscillation of the NN weights caused by the large learning rate SGD
training turns out to be beneficial to the generalization of the NN, which
potentially improves over the same NN trained by SGD with small learning rates
that converges more smoothly. In view of this finding, we call such a
phenomenon "benign oscillation". Our theory towards demystifying such a
phenomenon builds upon the feature learning perspective of deep learning.
Specifically, we consider a feature-noise data generation model that consists
of (i) weak features which have a small $\ell_2$-norm and appear in each data
point; (ii) strong features which have a larger $\ell_2$-norm but only appear
in a certain fraction of all data points; and (iii) noise. We prove that NNs
trained by oscillating SGD with a large learning rate can effectively learn the
weak features in the presence of those strong features. In contrast, NNs
trained by SGD with a small learning rate can only learn the strong features
but makes little progress in learning the weak features. Consequently, when it
comes to the new testing data which consist of only weak features, the NN
trained by oscillating SGD with a large learning rate could still make correct
predictions consistently, while the NN trained by small learning rate SGD
fails. Our theory sheds light on how large learning rate training benefits the
generalization of NNs. Experimental results demonstrate our finding on "benign
oscillation".
Related papers
- Label Deconvolution for Node Representation Learning on Large-scale
Attributed Graphs against Learning Bias [75.44877675117749]
We propose an efficient label regularization technique, namely Label Deconvolution (LD), to alleviate the learning bias by a novel and highly scalable approximation to the inverse mapping of GNNs.
Experiments demonstrate LD significantly outperforms state-of-the-art methods on Open Graph datasets Benchmark.
arXiv Detail & Related papers (2023-09-26T13:09:43Z) - Graph Neural Networks Provably Benefit from Structural Information: A
Feature Learning Perspective [53.999128831324576]
Graph neural networks (GNNs) have pioneered advancements in graph representation learning.
This study investigates the role of graph convolution within the context of feature learning theory.
arXiv Detail & Related papers (2023-06-24T10:21:11Z) - Benign Overfitting in Deep Neural Networks under Lazy Training [72.28294823115502]
We show that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification.
Our results indicate that interpolating with smoother functions leads to better generalization.
arXiv Detail & Related papers (2023-05-30T19:37:44Z) - Evolutionary Algorithms in the Light of SGD: Limit Equivalence, Minima
Flatness, and Transfer Learning [7.262048441360132]
We show that a class of evolutionary algorithms (EAs) inspired by the Gillespie-Orr Mutational Landscapes model for natural evolution is formally equivalent to the Gradient Descent (SGD)
We then show that for ANNs trained to near-optimality or in the transfer learning setting, the equivalence also allows transferring the insights from the Mutational Landscapes model to SGD.
arXiv Detail & Related papers (2023-05-20T22:26:44Z) - Neural networks trained with SGD learn distributions of increasing
complexity [78.30235086565388]
We show that neural networks trained using gradient descent initially classify their inputs using lower-order input statistics.
We then exploit higher-order statistics only later during training.
We discuss the relation of DSB to other simplicity biases and consider its implications for the principle of universality in learning.
arXiv Detail & Related papers (2022-11-21T15:27:22Z) - Learning Low Dimensional State Spaces with Overparameterized Recurrent
Neural Nets [57.06026574261203]
We provide theoretical evidence for learning low-dimensional state spaces, which can also model long-term memory.
Experiments corroborate our theory, demonstrating extrapolation via learning low-dimensional state spaces with both linear and non-linear RNNs.
arXiv Detail & Related papers (2022-10-25T14:45:15Z) - What Can the Neural Tangent Kernel Tell Us About Adversarial Robustness? [0.0]
We study adversarial examples of trained neural networks through analytical tools afforded by recent theory advances connecting neural networks and kernel methods.
We show how NTKs allow to generate adversarial examples in a training-free'' fashion, and demonstrate that they transfer to fool their finite-width neural net counterparts in the lazy'' regime.
arXiv Detail & Related papers (2022-10-11T16:11:48Z) - SGD with Large Step Sizes Learns Sparse Features [22.959258640051342]
We showcase important features of the dynamics of the Gradient Descent (SGD) in the training of neural networks.
We show that the longer large step sizes keep SGD high in the loss landscape, the better the implicit regularization can operate and find sparse representations.
arXiv Detail & Related papers (2022-10-11T11:00:04Z) - Linear Frequency Principle Model to Understand the Absence of
Overfitting in Neural Networks [4.86119220344659]
We show that low frequency dominance of target functions is the key condition for the non-overfitting of NNs.
Through an ideal two-layer NN, we unravel how detailed microscopic NN training dynamics statistically gives rise to a LFP model with quantitative prediction power.
arXiv Detail & Related papers (2021-01-30T10:11:37Z) - Gradient Starvation: A Learning Proclivity in Neural Networks [97.02382916372594]
Gradient Starvation arises when cross-entropy loss is minimized by capturing only a subset of features relevant for the task.
This work provides a theoretical explanation for the emergence of such feature imbalance in neural networks.
arXiv Detail & Related papers (2020-11-18T18:52:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.