Momentum Improves Normalized SGD
- URL: http://arxiv.org/abs/2002.03305v2
- Date: Sun, 17 May 2020 03:05:40 GMT
- Title: Momentum Improves Normalized SGD
- Authors: Ashok Cutkosky and Harsh Mehta
- Abstract summary: We show that adding momentum provably removes the need for large batch sizes on objectives.
We show that our method is effective when employed on popular large scale tasks such as ResNet-50 and BERT pretraining.
- Score: 51.27183254738711
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We provide an improved analysis of normalized SGD showing that adding
momentum provably removes the need for large batch sizes on non-convex
objectives. Then, we consider the case of objectives with bounded second
derivative and show that in this case a small tweak to the momentum formula
allows normalized SGD with momentum to find an $\epsilon$-critical point in
$O(1/\epsilon^{3.5})$ iterations, matching the best-known rates without
accruing any logarithmic factors or dependence on dimension. We also provide an
adaptive method that automatically improves convergence rates when the variance
in the gradients is small. Finally, we show that our method is effective when
employed on popular large scale tasks such as ResNet-50 and BERT pretraining,
matching the performance of the disparate methods used to get state-of-the-art
results on both tasks.
Related papers
- Demystifying SGD with Doubly Stochastic Gradients [13.033133586372612]
We establish the convergence properties of doubly SGD with independent minibatching and random reshuffling under general conditions.
We prove that random reshuffling improves the complexity dependence on the subsampling noise.
arXiv Detail & Related papers (2024-06-03T01:13:19Z) - Adaptive Federated Learning Over the Air [108.62635460744109]
We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training.
Our analysis shows that the AdaGrad-based training algorithm converges to a stationary point at the rate of $mathcalO( ln(T) / T 1 - frac1alpha ).
arXiv Detail & Related papers (2024-03-11T09:10:37Z) - Efficiency Ordering of Stochastic Gradient Descent [9.634481296779057]
We consider the gradient descent (SGD) algorithm driven by a general sampling sequence, including i.i.i.d noise and random walk on an arbitrary graph.
We employ the notion of efficiency ordering', a well-analyzed tool for comparing the performance of Markov Chain Monte Carlo samplers.
arXiv Detail & Related papers (2022-09-15T16:50:55Z) - On the Convergence of Stochastic Extragradient for Bilinear Games with
Restarted Iteration Averaging [96.13485146617322]
We present an analysis of the ExtraGradient (SEG) method with constant step size, and present variations of the method that yield favorable convergence.
We prove that when augmented with averaging, SEG provably converges to the Nash equilibrium, and such a rate is provably accelerated by incorporating a scheduled restarting procedure.
arXiv Detail & Related papers (2021-06-30T17:51:36Z) - Second-order step-size tuning of SGD for non-convex optimization [6.021787236982659]
In view of a direct and simple improvement of vanilla SGD, this paper presents a fine-tuning of its step-sizes in the mini-batch case.
One obtains a new first-order gradient method (Step-Tuned SGD) which can be seen as a version of the classical Barzilai-Borwein method.
arXiv Detail & Related papers (2021-03-05T10:01:48Z) - Correcting Momentum with Second-order Information [50.992629498861724]
We develop a new algorithm for non-critical optimization that finds an $O(epsilon)$epsilon point in the optimal product.
We validate our results on a variety of large-scale deep learning benchmarks and architectures.
arXiv Detail & Related papers (2021-03-04T19:01:20Z) - The Role of Momentum Parameters in the Optimal Convergence of Adaptive
Polyak's Heavy-ball Methods [12.93796690939018]
We prove that the adaptive Polyak's Heavy-ball (HB) method attains an optimal individual convergence rate of $O(frac1sqrtt)$.
Our new analysis shows how the HB momentum and its time-varying weight help us to achieve the acceleration in convex optimization.
arXiv Detail & Related papers (2021-02-15T02:57:14Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - Momentum-Based Policy Gradient Methods [133.53164856723782]
We propose a class of efficient momentum-based policy gradient methods for the model-free reinforcement learning.
In particular, we present a non-adaptive version of IS-MBPG method, which also reaches the best known sample complexity of $O(epsilon-3)$ without any large batches.
arXiv Detail & Related papers (2020-07-13T20:44:15Z) - Adaptive Gradient Methods Can Be Provably Faster than SGD after Finite
Epochs [25.158203665218164]
We show that adaptive gradient methods can be faster than random shuffling SGD after finite time.
To the best of our knowledge, it is the first to demonstrate that adaptive gradient methods can be faster than SGD after finite time.
arXiv Detail & Related papers (2020-06-12T09:39:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.