When and Why Momentum Accelerates SGD:An Empirical Study
- URL: http://arxiv.org/abs/2306.09000v1
- Date: Thu, 15 Jun 2023 09:54:21 GMT
- Title: When and Why Momentum Accelerates SGD:An Empirical Study
- Authors: Jingwen Fu, Bohan Wang, Huishuai Zhang, Zhizheng Zhang, Wei Chen,
Nanning Zheng
- Abstract summary: This study examines the performance of gradient descent (SGD) with momentum (SGDM)
We find that the momentum acceleration is closely related to emphabrupt sharpening which is to describe a sudden jump of the directional Hessian along the update direction.
Momentum improves the performance of SGDM by preventing or deferring the occurrence of abrupt sharpening.
- Score: 76.2666927020119
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Momentum has become a crucial component in deep learning optimizers,
necessitating a comprehensive understanding of when and why it accelerates
stochastic gradient descent (SGD). To address the question of ''when'', we
establish a meaningful comparison framework that examines the performance of
SGD with Momentum (SGDM) under the \emph{effective learning rates} $\eta_{ef}$,
a notion unifying the influence of momentum coefficient $\mu$ and batch size
$b$ over learning rate $\eta$. In the comparison of SGDM and SGD with the same
effective learning rate and the same batch size, we observe a consistent
pattern: when $\eta_{ef}$ is small, SGDM and SGD experience almost the same
empirical training losses; when $\eta_{ef}$ surpasses a certain threshold, SGDM
begins to perform better than SGD. Furthermore, we observe that the advantage
of SGDM over SGD becomes more pronounced with a larger batch size. For the
question of ``why'', we find that the momentum acceleration is closely related
to \emph{abrupt sharpening} which is to describe a sudden jump of the
directional Hessian along the update direction. Specifically, the misalignment
between SGD and SGDM happens at the same moment that SGD experiences abrupt
sharpening and converges slower. Momentum improves the performance of SGDM by
preventing or deferring the occurrence of abrupt sharpening. Together, this
study unveils the interplay between momentum, learning rates, and batch sizes,
thus improving our understanding of momentum acceleration.
Related papers
- Why (and When) does Local SGD Generalize Better than SGD? [46.993699881100454]
Local SGD is a communication-efficient variant of SGD for large-scale training.
This paper aims to understand why (and when) Local SGD generalizes better based on Differential Equation (SDE) approximation.
arXiv Detail & Related papers (2023-03-02T12:56:52Z) - Understanding Long Range Memory Effects in Deep Neural Networks [10.616643031188248]
textitstochastic gradient descent (SGD) is of fundamental importance in deep learning.
In this study, we argue that SGN is neither Gaussian nor stable. Instead, we propose that SGD can be viewed as a discretization of an SDE driven by textitfractional Brownian motion (FBM)
arXiv Detail & Related papers (2021-05-05T13:54:26Z) - DecentLaM: Decentralized Momentum SGD for Large-batch Deep Training [30.574484395380043]
Decentralized momentum SGD (DmSGD) is more communication efficient than Parallel momentum SGD that incurs global average across all computing nodes.
We propose DeLacent large-batch momentum performance models.
arXiv Detail & Related papers (2021-04-24T16:21:01Z) - Empirically explaining SGD from a line search perspective [21.35522589789314]
We show that the full-batch loss along lines in update step direction is highly parabolically.
We also show that there exists a learning rate with which SGD always performs almost exact line searches on the full-batch loss.
arXiv Detail & Related papers (2021-03-31T14:54:22Z) - Double Momentum SGD for Federated Learning [94.58442574293021]
We propose a new SGD variant named as DOMO to improve the model performance in federated learning.
One momentum buffer tracks the server update direction, while the other tracks the local update direction.
We introduce a novel server momentum fusion technique to coordinate the server and local momentum SGD.
arXiv Detail & Related papers (2021-02-08T02:47:24Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - Momentum Improves Normalized SGD [51.27183254738711]
We show that adding momentum provably removes the need for large batch sizes on objectives.
We show that our method is effective when employed on popular large scale tasks such as ResNet-50 and BERT pretraining.
arXiv Detail & Related papers (2020-02-09T07:00:54Z) - On the Generalization of Stochastic Gradient Descent with Momentum [84.54924994010703]
momentum-based accelerated variants of gradient descent (SGD) are widely used when training machine learning models.
We first show that there exists a convex loss function for which the stability gap for multiple epochs of SGD with standard heavy-ball momentum (SGDM) becomes unbounded.
For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, i.e., SGD with early momentum (SGDEM) under a broad range of step-sizes.
arXiv Detail & Related papers (2018-09-12T17:02:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.