Momentum via Primal Averaging: Theoretical Insights and Learning Rate
Schedules for Non-Convex Optimization
- URL: http://arxiv.org/abs/2010.00406v4
- Date: Tue, 1 Jun 2021 17:53:38 GMT
- Title: Momentum via Primal Averaging: Theoretical Insights and Learning Rate
Schedules for Non-Convex Optimization
- Authors: Aaron Defazio
- Abstract summary: Momentum methods are now used pervasively within the machine learning community for non-training models such as deep neural networks.
In this work we develop a Lyapunov analysis of SGD with momentum, by utilizing the SGD equivalent rewriting of the primal SGD method known as the SGDSPA) form.
- Score: 10.660480034605241
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Momentum methods are now used pervasively within the machine learning
community for training non-convex models such as deep neural networks.
Empirically, they out perform traditional stochastic gradient descent (SGD)
approaches. In this work we develop a Lyapunov analysis of SGD with momentum
(SGD+M), by utilizing a equivalent rewriting of the method known as the
stochastic primal averaging (SPA) form. This analysis is much tighter than
previous theory in the non-convex case, and due to this we are able to give
precise insights into when SGD+M may out-perform SGD, and what hyper-parameter
schedules will work and why.
Related papers
- NAG-GS: Semi-Implicit, Accelerated and Robust Stochastic Optimizer [45.47667026025716]
We propose a novel, robust and accelerated iteration that relies on two key elements.
The convergence and stability of the obtained method, referred to as NAG-GS, are first studied extensively.
We show that NAG-arity is competitive with state-the-art methods such as momentum SGD with weight decay and AdamW for the training of machine learning models.
arXiv Detail & Related papers (2022-09-29T16:54:53Z) - Stability and Generalization Analysis of Gradient Methods for Shallow
Neural Networks [59.142826407441106]
We study the generalization behavior of shallow neural networks (SNNs) by leveraging the concept of algorithmic stability.
We consider gradient descent (GD) and gradient descent (SGD) to train SNNs, for both of which we develop consistent excess bounds.
arXiv Detail & Related papers (2022-09-19T18:48:00Z) - Stochastic Training is Not Necessary for Generalization [57.04880404584737]
It is widely believed that the implicit regularization of gradient descent (SGD) is fundamental to the impressive generalization behavior we observe in neural networks.
In this work, we demonstrate that non-stochastic full-batch training can achieve strong performance on CIFAR-10 that is on-par with SGD.
arXiv Detail & Related papers (2021-09-29T00:50:00Z) - On the Hyperparameters in Stochastic Gradient Descent with Momentum [6.396288020763144]
We present the theoretical analysis for gradient descent with momentum (SGD) in this paper.
By we, and the final comparison is introduced, we show why the optimal linear rate for SGD only about the surrogate rate varies with increasing from zero to when the rate increases.
Finally, we show the surrogate momentum under the rate has no essential difference with the momentum.
arXiv Detail & Related papers (2021-08-09T11:25:03Z) - Positive-Negative Momentum: Manipulating Stochastic Gradient Noise to
Improve Generalization [89.7882166459412]
gradient noise (SGN) acts as implicit regularization for deep learning.
Some works attempted to artificially simulate SGN by injecting random noise to improve deep learning.
For simulating SGN at low computational costs and without changing the learning rate or batch size, we propose the Positive-Negative Momentum (PNM) approach.
arXiv Detail & Related papers (2021-03-31T16:08:06Z) - On the Generalization of Stochastic Gradient Descent with Momentum [58.900860437254885]
We first show that there exists a convex loss function for which algorithmic stability fails to establish generalization guarantees.
For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, and show that it admits an upper-bound on the generalization error.
For the special case of strongly convex loss functions, we find a range of momentum such that multiple epochs of standard SGDM, as a special form of SGDEM, also generalizes.
arXiv Detail & Related papers (2021-02-26T18:58:29Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - Reconciling Modern Deep Learning with Traditional Optimization Analyses:
The Intrinsic Learning Rate [36.83448475700536]
Recent works suggest that the use of Batch Normalization in today's deep learning can move it far from a traditional optimization viewpoint.
This paper highlights other ways in which behavior of normalized nets departs from traditional viewpoints.
We name it the Fast Equilibrium Conjecture and suggest it holds the key to why Batch Normalization is effective.
arXiv Detail & Related papers (2020-10-06T17:58:29Z) - A High Probability Analysis of Adaptive SGD with Momentum [22.9530287983179]
Gradient Descent (DSG) and its variants are the most used algorithms in machine learning applications.
We show for the first time the probability of the gradients to zero in smooth non setting for DelayedGrad with momentum.
arXiv Detail & Related papers (2020-07-28T15:06:22Z) - On the Generalization of Stochastic Gradient Descent with Momentum [84.54924994010703]
momentum-based accelerated variants of gradient descent (SGD) are widely used when training machine learning models.
We first show that there exists a convex loss function for which the stability gap for multiple epochs of SGD with standard heavy-ball momentum (SGDM) becomes unbounded.
For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, i.e., SGD with early momentum (SGDEM) under a broad range of step-sizes.
arXiv Detail & Related papers (2018-09-12T17:02:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.