Related papers: Momentum via Primal Averaging: Theoretical Insights and Learning Rate Schedules for Non-Convex Optimization

Momentum via Primal Averaging: Theoretical Insights and Learning Rate Schedules for Non-Convex Optimization

URL: http://arxiv.org/abs/2010.00406v4
Date: Tue, 1 Jun 2021 17:53:38 GMT
Title: Momentum via Primal Averaging: Theoretical Insights and Learning Rate Schedules for Non-Convex Optimization
Authors: Aaron Defazio
Abstract summary: Momentum methods are now used pervasively within the machine learning community for non-training models such as deep neural networks. In this work we develop a Lyapunov analysis of SGD with momentum, by utilizing the SGD equivalent rewriting of the primal SGD method known as the SGDSPA) form.
Score: 10.660480034605241
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Momentum methods are now used pervasively within the machine learning community for training non-convex models such as deep neural networks. Empirically, they out perform traditional stochastic gradient descent (SGD) approaches. In this work we develop a Lyapunov analysis of SGD with momentum (SGD+M), by utilizing a equivalent rewriting of the method known as the stochastic primal averaging (SPA) form. This analysis is much tighter than previous theory in the non-convex case, and due to this we are able to give precise insights into when SGD+M may out-perform SGD, and what hyper-parameter schedules will work and why.

Related papers

A Simplified Analysis of SGD for Linear Regression with Weight Averaging [64.2393952273612]
Recent work bycitetzou 2021benign provides sharp rates for SGD optimization in linear regression using constant learning rate.<n>We provide a simplified analysis recovering the same bias and variance bounds provided incitepzou 2021benign based on simple linear algebra tools.<n>We believe our work makes the analysis of gradient descent on linear regression very accessible and will be helpful in further analyzing mini-batching and learning rate scheduling.
arXiv Detail & Related papers (2025-06-18T15:10:38Z)
NAG-GS: Semi-Implicit, Accelerated and Robust Stochastic Optimizer [45.47667026025716]
We propose a novel, robust and accelerated iteration that relies on two key elements. The convergence and stability of the obtained method, referred to as NAG-GS, are first studied extensively. We show that NAG-arity is competitive with state-the-art methods such as momentum SGD with weight decay and AdamW for the training of machine learning models.
arXiv Detail & Related papers (2022-09-29T16:54:53Z)
Stability and Generalization Analysis of Gradient Methods for Shallow Neural Networks [59.142826407441106]
We study the generalization behavior of shallow neural networks (SNNs) by leveraging the concept of algorithmic stability. We consider gradient descent (GD) and gradient descent (SGD) to train SNNs, for both of which we develop consistent excess bounds.
arXiv Detail & Related papers (2022-09-19T18:48:00Z)
Stochastic Training is Not Necessary for Generalization [57.04880404584737]
It is widely believed that the implicit regularization of gradient descent (SGD) is fundamental to the impressive generalization behavior we observe in neural networks. In this work, we demonstrate that non-stochastic full-batch training can achieve strong performance on CIFAR-10 that is on-par with SGD.
arXiv Detail & Related papers (2021-09-29T00:50:00Z)
On the Hyperparameters in Stochastic Gradient Descent with Momentum [6.396288020763144]
We present the theoretical analysis for gradient descent with momentum (SGD) in this paper. By we, and the final comparison is introduced, we show why the optimal linear rate for SGD only about the surrogate rate varies with increasing from zero to when the rate increases. Finally, we show the surrogate momentum under the rate has no essential difference with the momentum.
arXiv Detail & Related papers (2021-08-09T11:25:03Z)
Positive-Negative Momentum: Manipulating Stochastic Gradient Noise to Improve Generalization [89.7882166459412]
gradient noise (SGN) acts as implicit regularization for deep learning. Some works attempted to artificially simulate SGN by injecting random noise to improve deep learning. For simulating SGN at low computational costs and without changing the learning rate or batch size, we propose the Positive-Negative Momentum (PNM) approach.
arXiv Detail & Related papers (2021-03-31T16:08:06Z)
On the Generalization of Stochastic Gradient Descent with Momentum [58.900860437254885]
We first show that there exists a convex loss function for which algorithmic stability fails to establish generalization guarantees. For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, and show that it admits an upper-bound on the generalization error. For the special case of strongly convex loss functions, we find a range of momentum such that multiple epochs of standard SGDM, as a special form of SGDEM, also generalizes.
arXiv Detail & Related papers (2021-02-26T18:58:29Z)
Direction Matters: On the Implicit Bias of Stochastic Gradient Descent with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime. We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z)
Reconciling Modern Deep Learning with Traditional Optimization Analyses: The Intrinsic Learning Rate [36.83448475700536]
Recent works suggest that the use of Batch Normalization in today's deep learning can move it far from a traditional optimization viewpoint. This paper highlights other ways in which behavior of normalized nets departs from traditional viewpoints. We name it the Fast Equilibrium Conjecture and suggest it holds the key to why Batch Normalization is effective.
arXiv Detail & Related papers (2020-10-06T17:58:29Z)
A High Probability Analysis of Adaptive SGD with Momentum [22.9530287983179]
Gradient Descent (DSG) and its variants are the most used algorithms in machine learning applications. We show for the first time the probability of the gradients to zero in smooth non setting for DelayedGrad with momentum.
arXiv Detail & Related papers (2020-07-28T15:06:22Z)
On the Generalization of Stochastic Gradient Descent with Momentum [84.54924994010703]
momentum-based accelerated variants of gradient descent (SGD) are widely used when training machine learning models. We first show that there exists a convex loss function for which the stability gap for multiple epochs of SGD with standard heavy-ball momentum (SGDM) becomes unbounded. For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, i.e., SGD with early momentum (SGDEM) under a broad range of step-sizes.
arXiv Detail & Related papers (2018-09-12T17:02:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.