SGD with Large Step Sizes Learns Sparse Features
- URL: http://arxiv.org/abs/2210.05337v2
- Date: Wed, 7 Jun 2023 09:50:21 GMT
- Title: SGD with Large Step Sizes Learns Sparse Features
- Authors: Maksym Andriushchenko, Aditya Varre, Loucas Pillaud-Vivien, Nicolas
Flammarion
- Abstract summary: We showcase important features of the dynamics of the Gradient Descent (SGD) in the training of neural networks.
We show that the longer large step sizes keep SGD high in the loss landscape, the better the implicit regularization can operate and find sparse representations.
- Score: 22.959258640051342
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We showcase important features of the dynamics of the Stochastic Gradient
Descent (SGD) in the training of neural networks. We present empirical
observations that commonly used large step sizes (i) lead the iterates to jump
from one side of a valley to the other causing loss stabilization, and (ii)
this stabilization induces a hidden stochastic dynamics orthogonal to the
bouncing directions that biases it implicitly toward sparse predictors.
Furthermore, we show empirically that the longer large step sizes keep SGD high
in the loss landscape valleys, the better the implicit regularization can
operate and find sparse representations. Notably, no explicit regularization is
used so that the regularization effect comes solely from the SGD training
dynamics influenced by the step size schedule. Therefore, these observations
unveil how, through the step size schedules, both gradient and noise drive
together the SGD dynamics through the loss landscape of neural networks. We
justify these findings theoretically through the study of simple neural network
models as well as qualitative arguments inspired from stochastic processes.
Finally, this analysis allows us to shed a new light on some common practice
and observed phenomena when training neural networks. The code of our
experiments is available at https://github.com/tml-epfl/sgd-sparse-features.
Related papers
- Benign Oscillation of Stochastic Gradient Descent with Large Learning
Rates [21.8377731053374]
We investigate the generalization properties of neural networks (NN) trained by gradient descent (SGD) algorithm with large learning rates.
Under such a training regime, our finding is that, the oscillation of the NN weights caused by the large learning rate SGD training turns out to be beneficial to the generalization of the NN.
arXiv Detail & Related papers (2023-10-26T00:35:40Z) - Law of Balance and Stationary Distribution of Stochastic Gradient
Descent [11.937085301750288]
We prove that the minibatch noise of gradient descent (SGD) regularizes the solution towards a balanced solution whenever the loss function contains a rescaling symmetry.
We then derive the stationary distribution of gradient flow for a diagonal linear network with arbitrary depth and width.
These phenomena are shown to exist uniquely in deep networks, implying a fundamental difference between deep and shallow models.
arXiv Detail & Related papers (2023-08-13T03:13:03Z) - Implicit Stochastic Gradient Descent for Training Physics-informed
Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems.
PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features.
In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z) - Learning Low Dimensional State Spaces with Overparameterized Recurrent
Neural Nets [57.06026574261203]
We provide theoretical evidence for learning low-dimensional state spaces, which can also model long-term memory.
Experiments corroborate our theory, demonstrating extrapolation via learning low-dimensional state spaces with both linear and non-linear RNNs.
arXiv Detail & Related papers (2022-10-25T14:45:15Z) - Stability and Generalization Analysis of Gradient Methods for Shallow
Neural Networks [59.142826407441106]
We study the generalization behavior of shallow neural networks (SNNs) by leveraging the concept of algorithmic stability.
We consider gradient descent (GD) and gradient descent (SGD) to train SNNs, for both of which we develop consistent excess bounds.
arXiv Detail & Related papers (2022-09-19T18:48:00Z) - The Limiting Dynamics of SGD: Modified Loss, Phase Space Oscillations,
and Anomalous Diffusion [29.489737359897312]
We study the limiting dynamics of deep neural networks trained with gradient descent (SGD)
We show that the key ingredient driving these dynamics is not the original training loss, but rather the combination of a modified loss, which implicitly regularizes the velocity and probability currents, which cause oscillations in phase space.
arXiv Detail & Related papers (2021-07-19T20:18:57Z) - Gradient Starvation: A Learning Proclivity in Neural Networks [97.02382916372594]
Gradient Starvation arises when cross-entropy loss is minimized by capturing only a subset of features relevant for the task.
This work provides a theoretical explanation for the emergence of such feature imbalance in neural networks.
arXiv Detail & Related papers (2020-11-18T18:52:08Z) - Neural networks with late-phase weights [66.72777753269658]
We show that the solutions found by SGD can be further improved by ensembling a subset of the weights in late stages of learning.
At the end of learning, we obtain back a single model by taking a spatial average in weight space.
arXiv Detail & Related papers (2020-07-25T13:23:37Z) - The Break-Even Point on Optimization Trajectories of Deep Neural
Networks [64.7563588124004]
We argue for the existence of the "break-even" point on this trajectory.
We show that using a large learning rate in the initial phase of training reduces the variance of the gradient.
We also show that using a low learning rate results in bad conditioning of the loss surface even for a neural network with batch normalization layers.
arXiv Detail & Related papers (2020-02-21T22:55:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.