Adaptive Inertia: Disentangling the Effects of Adaptive Learning Rate
and Momentum
- URL: http://arxiv.org/abs/2006.15815v11
- Date: Sun, 7 Feb 2021 11:53:48 GMT
- Title: Adaptive Inertia: Disentangling the Effects of Adaptive Learning Rate
and Momentum
- Authors: Zeke Xie, Xinrui Wang, Huishuai Zhang, Issei Sato, Masashi Sugiyama
- Abstract summary: We disentangle the effects of Adaptive Learning Rate and Momentum of the Adam dynamics on saddle-point escaping and flat minima selection.
Our experiments demonstrate that the proposed adaptive inertia method can generalize significantly better than SGD and conventional adaptive gradient methods.
- Score: 97.84312669132716
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Adaptive Moment Estimation (Adam), which combines Adaptive Learning Rate and
Momentum, would be the most popular stochastic optimizer for accelerating the
training of deep neural networks. However, it is empirically known that Adam
often generalizes worse than Stochastic Gradient Descent (SGD). The purpose of
this paper is to unveil the mystery of this behavior in the diffusion
theoretical framework. Specifically, we disentangle the effects of Adaptive
Learning Rate and Momentum of the Adam dynamics on saddle-point escaping and
flat minima selection. We prove that Adaptive Learning Rate can escape saddle
points efficiently, but cannot select flat minima as SGD does. In contrast,
Momentum provides a drift effect to help the training process pass through
saddle points, and almost does not affect flat minima selection. This partly
explains why SGD (with Momentum) generalizes better, while Adam generalizes
worse but converges faster. Furthermore, motivated by the analysis, we design a
novel adaptive optimization framework named Adaptive Inertia, which uses
parameter-wise adaptive inertia to accelerate the training and provably favors
flat minima as well as SGD. Our extensive experiments demonstrate that the
proposed adaptive inertia method can generalize significantly better than SGD
and conventional adaptive gradient methods.
Related papers
- Understanding Optimization in Deep Learning with Central Flows [53.66160508990508]
We show that an RMS's implicit behavior can be explicitly captured by a "central flow:" a differential equation.
We show that these flows can empirically predict long-term optimization trajectories of generic neural networks.
arXiv Detail & Related papers (2024-10-31T17:58:13Z) - The Optimality of (Accelerated) SGD for High-Dimensional Quadratic Optimization [4.7256945641654164]
gradient descent (SGD) is a widely used algorithm in machine learning, particularly for neural network training.
Recent studies on SGD for canonical quadratic optimization or linear regression show it attains well generalization under suitable high-dimensional settings.
This paper investigates SGD with two essential components in practice: exponentially decaying step size schedule and momentum.
arXiv Detail & Related papers (2024-09-15T14:20:03Z) - Signal Processing Meets SGD: From Momentum to Filter [6.751292200515353]
In deep learning, gradient descent (SGD) and its momentum-based variants are widely used for optimization.
We propose a novel optimization method designed to accelerate SGD's convergence without sacrificing generalization.
arXiv Detail & Related papers (2023-11-06T01:41:46Z) - Flatter, faster: scaling momentum for optimal speedup of SGD [0.0]
We study training dynamics arising from interplay between gradient descent (SGD) and label noise and momentum in the training of neural networks.
We find that scaling the momentum hyper parameter $1-NISTbeta$ with the learning rate to the power of $2/3$ maximally accelerates training, without sacrificing generalization.
arXiv Detail & Related papers (2022-10-28T20:41:48Z) - No Parameters Left Behind: Sensitivity Guided Adaptive Learning Rate for
Training Large Transformer Models [132.90062129639705]
We propose a novel training strategy that encourages all parameters to be trained sufficiently.
A parameter with low sensitivity is redundant, and we improve its fitting by increasing its learning rate.
In contrast, a parameter with high sensitivity is well-trained and we regularize it by decreasing its learning rate to prevent further overfitting.
arXiv Detail & Related papers (2022-02-06T00:22:28Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - Adaptive Gradient Method with Resilience and Momentum [120.83046824742455]
We propose an Adaptive Gradient Method with Resilience and Momentum (AdaRem)
AdaRem adjusts the parameter-wise learning rate according to whether the direction of one parameter changes in the past is aligned with the direction of the current gradient.
Our method outperforms previous adaptive learning rate-based algorithms in terms of the training speed and the test error.
arXiv Detail & Related papers (2020-10-21T14:49:00Z) - MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of
Gradients [112.00379151834242]
We propose adaptive learning rate principle, in which the running mean of squared gradient in Adam is replaced by a weighted mean, with weights chosen to maximize the estimated variance each coordinate.
This results in faster adaptation, which leads more desirable empirical convergence behaviors.
arXiv Detail & Related papers (2020-06-21T21:47:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.