Towards understanding how momentum improves generalization in deep
learning
- URL: http://arxiv.org/abs/2207.05931v1
- Date: Wed, 13 Jul 2022 02:39:08 GMT
- Title: Towards understanding how momentum improves generalization in deep
learning
- Authors: Samy Jelassi, Yuanzhi Li
- Abstract summary: We show that gradient descent with momentum (GD+M) significantly improves generalization compared to gradient descent (GD) in some deep learning problems.
Key insight in our analysis is that momentum is beneficial in datasets where the examples share some feature but differ in their margin.
- Score: 44.441873298005326
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Stochastic gradient descent (SGD) with momentum is widely used for training
modern deep learning architectures. While it is well-understood that using
momentum can lead to faster convergence rate in various settings, it has also
been observed that momentum yields higher generalization. Prior work argue that
momentum stabilizes the SGD noise during training and this leads to higher
generalization. In this paper, we adopt another perspective and first
empirically show that gradient descent with momentum (GD+M) significantly
improves generalization compared to gradient descent (GD) in some deep learning
problems. From this observation, we formally study how momentum improves
generalization. We devise a binary classification setting where a one-hidden
layer (over-parameterized) convolutional neural network trained with GD+M
provably generalizes better than the same network trained with GD, when both
algorithms are similarly initialized. The key insight in our analysis is that
momentum is beneficial in datasets where the examples share some feature but
differ in their margin. Contrary to GD that memorizes the small margin data,
GD+M still learns the feature in these data thanks to its historical gradients.
Lastly, we empirically validate our theoretical findings.
Related papers
- Exact, Tractable Gauss-Newton Optimization in Deep Reversible Architectures Reveal Poor Generalization [52.16435732772263]
Second-order optimization has been shown to accelerate the training of deep neural networks in many applications.
However, generalization properties of second-order methods are still being debated.
We show for the first time that exact Gauss-Newton (GN) updates take on a tractable form in a class of deep architectures.
arXiv Detail & Related papers (2024-11-12T17:58:40Z) - Are GATs Out of Balance? [73.2500577189791]
We study the Graph Attention Network (GAT) in which a node's neighborhood aggregation is weighted by parameterized attention coefficients.
Our main theorem serves as a stepping stone to studying the learning dynamics of positive homogeneous models with attention mechanisms.
arXiv Detail & Related papers (2023-10-11T06:53:05Z) - Flatter, faster: scaling momentum for optimal speedup of SGD [0.0]
We study training dynamics arising from interplay between gradient descent (SGD) and label noise and momentum in the training of neural networks.
We find that scaling the momentum hyper parameter $1-NISTbeta$ with the learning rate to the power of $2/3$ maximally accelerates training, without sacrificing generalization.
arXiv Detail & Related papers (2022-10-28T20:41:48Z) - SGD with Large Step Sizes Learns Sparse Features [22.959258640051342]
We showcase important features of the dynamics of the Gradient Descent (SGD) in the training of neural networks.
We show that the longer large step sizes keep SGD high in the loss landscape, the better the implicit regularization can operate and find sparse representations.
arXiv Detail & Related papers (2022-10-11T11:00:04Z) - On the Generalization Mystery in Deep Learning [15.2292571922932]
We argue that the answer to two questions lies in the interaction of the gradients of different examples during training.
We formalize this argument with an easy to compute and interpretable metric for coherence.
The theory also explains a number of other phenomena in deep learning, such as why some examples are reliably learned earlier than others.
arXiv Detail & Related papers (2022-03-18T16:09:53Z) - Adapting Stepsizes by Momentumized Gradients Improves Optimization and
Generalization [89.66571637204012]
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
arXiv Detail & Related papers (2021-06-22T03:13:23Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - On the Generalization of Stochastic Gradient Descent with Momentum [84.54924994010703]
momentum-based accelerated variants of gradient descent (SGD) are widely used when training machine learning models.
We first show that there exists a convex loss function for which the stability gap for multiple epochs of SGD with standard heavy-ball momentum (SGDM) becomes unbounded.
For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, i.e., SGD with early momentum (SGDEM) under a broad range of step-sizes.
arXiv Detail & Related papers (2018-09-12T17:02:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.