On the Generalization of Stochastic Gradient Descent with Momentum
- URL: http://arxiv.org/abs/1809.04564v3
- Date: Mon, 15 Jan 2024 14:55:29 GMT
- Title: On the Generalization of Stochastic Gradient Descent with Momentum
- Authors: Ali Ramezani-Kebrya, Kimon Antonakopoulos, Volkan Cevher, Ashish
Khisti, Ben Liang
- Abstract summary: momentum-based accelerated variants of gradient descent (SGD) are widely used when training machine learning models.
We first show that there exists a convex loss function for which the stability gap for multiple epochs of SGD with standard heavy-ball momentum (SGDM) becomes unbounded.
For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, i.e., SGD with early momentum (SGDEM) under a broad range of step-sizes.
- Score: 84.54924994010703
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While momentum-based accelerated variants of stochastic gradient descent
(SGD) are widely used when training machine learning models, there is little
theoretical understanding on the generalization error of such methods. In this
work, we first show that there exists a convex loss function for which the
stability gap for multiple epochs of SGD with standard heavy-ball momentum
(SGDM) becomes unbounded. Then, for smooth Lipschitz loss functions, we analyze
a modified momentum-based update rule, i.e., SGD with early momentum (SGDEM)
under a broad range of step-sizes, and show that it can train machine learning
models for multiple epochs with a guarantee for generalization. Finally, for
the special case of strongly convex loss functions, we find a range of momentum
such that multiple epochs of standard SGDM, as a special form of SGDEM, also
generalizes. Extending our results on generalization, we also develop an upper
bound on the expected true risk, in terms of the number of training steps,
sample size, and momentum. Our experimental evaluations verify the consistency
between the numerical results and our theoretical bounds. SGDEM improves the
generalization error of SGDM when training ResNet-18 on ImageNet in practical
distributed settings.
Related papers
- Stability and Generalization Analysis of Gradient Methods for Shallow
Neural Networks [59.142826407441106]
We study the generalization behavior of shallow neural networks (SNNs) by leveraging the concept of algorithmic stability.
We consider gradient descent (GD) and gradient descent (SGD) to train SNNs, for both of which we develop consistent excess bounds.
arXiv Detail & Related papers (2022-09-19T18:48:00Z) - On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD)
We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting.
We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z) - Stochastic Training is Not Necessary for Generalization [57.04880404584737]
It is widely believed that the implicit regularization of gradient descent (SGD) is fundamental to the impressive generalization behavior we observe in neural networks.
In this work, we demonstrate that non-stochastic full-batch training can achieve strong performance on CIFAR-10 that is on-par with SGD.
arXiv Detail & Related papers (2021-09-29T00:50:00Z) - On the Generalization of Stochastic Gradient Descent with Momentum [58.900860437254885]
We first show that there exists a convex loss function for which algorithmic stability fails to establish generalization guarantees.
For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, and show that it admits an upper-bound on the generalization error.
For the special case of strongly convex loss functions, we find a range of momentum such that multiple epochs of standard SGDM, as a special form of SGDEM, also generalizes.
arXiv Detail & Related papers (2021-02-26T18:58:29Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - Momentum via Primal Averaging: Theoretical Insights and Learning Rate
Schedules for Non-Convex Optimization [10.660480034605241]
Momentum methods are now used pervasively within the machine learning community for non-training models such as deep neural networks.
In this work we develop a Lyapunov analysis of SGD with momentum, by utilizing the SGD equivalent rewriting of the primal SGD method known as the SGDSPA) form.
arXiv Detail & Related papers (2020-10-01T13:46:32Z) - Hausdorff Dimension, Heavy Tails, and Generalization in Neural Networks [27.54155197562196]
We show that the trajectories of gradient descent (SGD) can be wellapproximated by a emphFeller process.
We propose a "capacity metric" to measure the success of such generalizations.
arXiv Detail & Related papers (2020-06-16T16:57:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.