On the Generalization of Stochastic Gradient Descent with Momentum
- URL: http://arxiv.org/abs/2102.13653v1
- Date: Fri, 26 Feb 2021 18:58:29 GMT
- Title: On the Generalization of Stochastic Gradient Descent with Momentum
- Authors: Ali Ramezani-Kebrya, Ashish Khisti, Ben Liang
- Abstract summary: We first show that there exists a convex loss function for which algorithmic stability fails to establish generalization guarantees.
For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, and show that it admits an upper-bound on the generalization error.
For the special case of strongly convex loss functions, we find a range of momentum such that multiple epochs of standard SGDM, as a special form of SGDEM, also generalizes.
- Score: 58.900860437254885
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While momentum-based methods, in conjunction with stochastic gradient descent
(SGD), are widely used when training machine learning models, there is little
theoretical understanding on the generalization error of such methods. In this
work, we first show that there exists a convex loss function for which
algorithmic stability fails to establish generalization guarantees when SGD
with standard heavy-ball momentum (SGDM) is run for multiple epochs. Then, for
smooth Lipschitz loss functions, we analyze a modified momentum-based update
rule, i.e., SGD with early momentum (SGDEM), and show that it admits an
upper-bound on the generalization error. Thus, our results show that machine
learning models can be trained for multiple epochs of SGDEM with a guarantee
for generalization. Finally, for the special case of strongly convex loss
functions, we find a range of momentum such that multiple epochs of standard
SGDM, as a special form of SGDEM, also generalizes. Extending our results on
generalization, we also develop an upper-bound on the expected true risk, in
terms of the number of training steps, the size of the training set, and the
momentum parameter. Experimental evaluations verify the consistency between the
numerical results and our theoretical bounds and the effectiveness of SGDEM for
smooth Lipschitz loss functions.
Related papers
- Estimating Generalization Performance Along the Trajectory of Proximal SGD in Robust Regression [4.150180443030652]
We introduce estimators that precisely track the generalization error of the iterates along the trajectory of the iterative algorithm.
The results are illustrated through several examples, including Huber regression, pseudo-Huber regression, and their penalized variants with non-smooth regularizer.
arXiv Detail & Related papers (2024-10-03T16:13:42Z) - Benign Underfitting of Stochastic Gradient Descent [72.38051710389732]
We study to what extent may gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit training data.
We analyze the closely related with-replacement SGD, for which an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate.
arXiv Detail & Related papers (2022-02-27T13:25:01Z) - On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD)
We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting.
We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z) - Stochastic Training is Not Necessary for Generalization [57.04880404584737]
It is widely believed that the implicit regularization of gradient descent (SGD) is fundamental to the impressive generalization behavior we observe in neural networks.
In this work, we demonstrate that non-stochastic full-batch training can achieve strong performance on CIFAR-10 that is on-par with SGD.
arXiv Detail & Related papers (2021-09-29T00:50:00Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - Hausdorff Dimension, Heavy Tails, and Generalization in Neural Networks [27.54155197562196]
We show that the trajectories of gradient descent (SGD) can be wellapproximated by a emphFeller process.
We propose a "capacity metric" to measure the success of such generalizations.
arXiv Detail & Related papers (2020-06-16T16:57:12Z) - On the Generalization of Stochastic Gradient Descent with Momentum [84.54924994010703]
momentum-based accelerated variants of gradient descent (SGD) are widely used when training machine learning models.
We first show that there exists a convex loss function for which the stability gap for multiple epochs of SGD with standard heavy-ball momentum (SGDM) becomes unbounded.
For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, i.e., SGD with early momentum (SGDEM) under a broad range of step-sizes.
arXiv Detail & Related papers (2018-09-12T17:02:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.