On the Generalization of Stochastic Gradient Descent with Momentum
- URL: http://arxiv.org/abs/2102.13653v1
- Date: Fri, 26 Feb 2021 18:58:29 GMT
- Title: On the Generalization of Stochastic Gradient Descent with Momentum
- Authors: Ali Ramezani-Kebrya, Ashish Khisti, Ben Liang
- Abstract summary: We first show that there exists a convex loss function for which algorithmic stability fails to establish generalization guarantees.
For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, and show that it admits an upper-bound on the generalization error.
For the special case of strongly convex loss functions, we find a range of momentum such that multiple epochs of standard SGDM, as a special form of SGDEM, also generalizes.
- Score: 58.900860437254885
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While momentum-based methods, in conjunction with stochastic gradient descent
(SGD), are widely used when training machine learning models, there is little
theoretical understanding on the generalization error of such methods. In this
work, we first show that there exists a convex loss function for which
algorithmic stability fails to establish generalization guarantees when SGD
with standard heavy-ball momentum (SGDM) is run for multiple epochs. Then, for
smooth Lipschitz loss functions, we analyze a modified momentum-based update
rule, i.e., SGD with early momentum (SGDEM), and show that it admits an
upper-bound on the generalization error. Thus, our results show that machine
learning models can be trained for multiple epochs of SGDEM with a guarantee
for generalization. Finally, for the special case of strongly convex loss
functions, we find a range of momentum such that multiple epochs of standard
SGDM, as a special form of SGDEM, also generalizes. Extending our results on
generalization, we also develop an upper-bound on the expected true risk, in
terms of the number of training steps, the size of the training set, and the
momentum parameter. Experimental evaluations verify the consistency between the
numerical results and our theoretical bounds and the effectiveness of SGDEM for
smooth Lipschitz loss functions.
Related papers
- Implicit Regularization or Implicit Conditioning? Exact Risk
Trajectories of SGD in High Dimensions [26.782342518986503]
gradient descent (SGD) is a pillar of modern machine learning, serving as the go-to optimization algorithm for a diverse array of problems.
We show how to adapt the HSGD formalism to include streaming SGD, which allows us to produce an exact prediction for the excess risk of multi-pass SGD relative to that of streaming SGD.
arXiv Detail & Related papers (2022-06-15T02:32:26Z) - Benign Underfitting of Stochastic Gradient Descent [72.38051710389732]
We study to what extent may gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit training data.
We analyze the closely related with-replacement SGD, for which an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate.
arXiv Detail & Related papers (2022-02-27T13:25:01Z) - On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD)
We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting.
We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z) - Stochastic Training is Not Necessary for Generalization [57.04880404584737]
It is widely believed that the implicit regularization of gradient descent (SGD) is fundamental to the impressive generalization behavior we observe in neural networks.
In this work, we demonstrate that non-stochastic full-batch training can achieve strong performance on CIFAR-10 that is on-par with SGD.
arXiv Detail & Related papers (2021-09-29T00:50:00Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - Hausdorff Dimension, Heavy Tails, and Generalization in Neural Networks [27.54155197562196]
We show that the trajectories of gradient descent (SGD) can be wellapproximated by a emphFeller process.
We propose a "capacity metric" to measure the success of such generalizations.
arXiv Detail & Related papers (2020-06-16T16:57:12Z) - On the Generalization of Stochastic Gradient Descent with Momentum [84.54924994010703]
momentum-based accelerated variants of gradient descent (SGD) are widely used when training machine learning models.
We first show that there exists a convex loss function for which the stability gap for multiple epochs of SGD with standard heavy-ball momentum (SGDM) becomes unbounded.
For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, i.e., SGD with early momentum (SGDEM) under a broad range of step-sizes.
arXiv Detail & Related papers (2018-09-12T17:02:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.