Related papers: Escaping Saddle Points Faster with Stochastic Momentum

Escaping Saddle Points Faster with Stochastic Momentum

URL: http://arxiv.org/abs/2106.02985v1
Date: Sat, 5 Jun 2021 23:34:02 GMT
Title: Escaping Saddle Points Faster with Stochastic Momentum
Authors: Jun-Kun Wang and Chi-Heng Lin and Jacob Abernethy
Abstract summary: In deep networks, momentum appears to significantly improve convergence time. We show that momentum improves deep training because it modifies SGD to escape points faster. We also show how to choose the ideal momentum parameter.
Score: 9.485782209646445
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Stochastic gradient descent (SGD) with stochastic momentum is popular in nonconvex stochastic optimization and particularly for the training of deep neural networks. In standard SGD, parameters are updated by improving along the path of the gradient at the current iterate on a batch of examples, where the addition of a ``momentum'' term biases the update in the direction of the previous change in parameters. In non-stochastic convex optimization one can show that a momentum adjustment provably reduces convergence time in many settings, yet such results have been elusive in the stochastic and non-convex settings. At the same time, a widely-observed empirical phenomenon is that in training deep networks stochastic momentum appears to significantly improve convergence time, variants of it have flourished in the development of other popular update methods, e.g. ADAM [KB15], AMSGrad [RKK18], etc. Yet theoretical justification for the use of stochastic momentum has remained a significant open question. In this paper we propose an answer: stochastic momentum improves deep network training because it modifies SGD to escape saddle points faster and, consequently, to more quickly find a second order stationary point. Our theoretical results also shed light on the related question of how to choose the ideal momentum parameter--our analysis suggests that $\beta \in [0,1)$ should be large (close to 1), which comports with empirical findings. We also provide experimental findings that further validate these conclusions.

Related papers

Faster Convergence of Stochastic Accelerated Gradient Descent under Interpolation [51.248784084461334]
We prove new convergence rates for a generalized version of Nesterov acceleration underrho conditions. Our analysis reduces the dependence on the strong growth constant from $$ to $sqrt$ as compared to prior work.
arXiv Detail & Related papers (2024-04-03T00:41:19Z)
Role of Momentum in Smoothing Objective Function and Generalizability of Deep Neural Networks [0.6906005491572401]
We show that noise in gradient descent (SGD) with momentum smoothes the objective function, the degree of which is determined by the learning rate, the batch size, the momentum factor, and the upper bound of the norm. We also provide experimental results supporting our assertion model generalizability depends on the noise level.
arXiv Detail & Related papers (2024-02-04T02:48:28Z)
The Marginal Value of Momentum for Small Learning Rate SGD [20.606430391298815]
Momentum is known to accelerate the convergence of gradient descent in strongly convex settings without gradient noise regimes. Experiments show that momentum indeed has limited benefits for both optimization and generalization in practical training where the optimal learning rate is not very large.
arXiv Detail & Related papers (2023-07-27T21:01:26Z)
Machine learning in and out of equilibrium [58.88325379746631]
Our study uses a Fokker-Planck approach, adapted from statistical physics, to explore these parallels. We focus in particular on the stationary state of the system in the long-time limit, which in conventional SGD is out of equilibrium. We propose a new variation of Langevin dynamics (SGLD) that harnesses without replacement minibatching.
arXiv Detail & Related papers (2023-06-06T09:12:49Z)
Last-iterate convergence analysis of stochastic momentum methods for neural networks [3.57214198937538]
The momentum method is used to solve large-scale optimization problems in neural networks. Current convergence results of momentum methods under artificial settings. The momentum factors can be fixed to be constant, rather than in existing time.
arXiv Detail & Related papers (2022-05-30T02:17:44Z)
Convergence and Stability of the Stochastic Proximal Point Algorithm with Momentum [14.158845925610438]
We show how a gradient proximal algorithm with momentum (PPA) allows faster convergence to a neighborhood compared to the proximal algorithm (PPA) with better contraction factor.
arXiv Detail & Related papers (2021-11-11T12:17:22Z)
Positive-Negative Momentum: Manipulating Stochastic Gradient Noise to Improve Generalization [89.7882166459412]
gradient noise (SGN) acts as implicit regularization for deep learning. Some works attempted to artificially simulate SGN by injecting random noise to improve deep learning. For simulating SGN at low computational costs and without changing the learning rate or batch size, we propose the Positive-Negative Momentum (PNM) approach.
arXiv Detail & Related papers (2021-03-31T16:08:06Z)
Critical Parameters for Scalable Distributed Learning with Large Batches and Asynchronous Updates [67.19481956584465]
It has been experimentally observed that the efficiency of distributed training with saturation (SGD) depends decisively on the batch size and -- in implementations -- on the staleness. We show that our results are tight and illustrate key findings in numerical experiments.
arXiv Detail & Related papers (2021-03-03T12:08:23Z)
Adaptive Gradient Method with Resilience and Momentum [120.83046824742455]
We propose an Adaptive Gradient Method with Resilience and Momentum (AdaRem) AdaRem adjusts the parameter-wise learning rate according to whether the direction of one parameter changes in the past is aligned with the direction of the current gradient. Our method outperforms previous adaptive learning rate-based algorithms in terms of the training speed and the test error.
arXiv Detail & Related papers (2020-10-21T14:49:00Z)
A High Probability Analysis of Adaptive SGD with Momentum [22.9530287983179]
Gradient Descent (DSG) and its variants are the most used algorithms in machine learning applications. We show for the first time the probability of the gradients to zero in smooth non setting for DelayedGrad with momentum.
arXiv Detail & Related papers (2020-07-28T15:06:22Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.