On the Hyperparameters in Stochastic Gradient Descent with Momentum
- URL: http://arxiv.org/abs/2108.03947v1
- Date: Mon, 9 Aug 2021 11:25:03 GMT
- Title: On the Hyperparameters in Stochastic Gradient Descent with Momentum
- Authors: Bin Shi
- Abstract summary: We present the theoretical analysis for gradient descent with momentum (SGD) in this paper.
By we, and the final comparison is introduced, we show why the optimal linear rate for SGD only about the surrogate rate varies with increasing from zero to when the rate increases.
Finally, we show the surrogate momentum under the rate has no essential difference with the momentum.
- Score: 6.396288020763144
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Following the same routine as [SSJ20], we continue to present the theoretical
analysis for stochastic gradient descent with momentum (SGD with momentum) in
this paper. Differently, for SGD with momentum, we demonstrate it is the two
hyperparameters together, the learning rate and the momentum coefficient, that
play the significant role for the linear rate of convergence in non-convex
optimization. Our analysis is based on the use of a hyperparameters-dependent
stochastic differential equation (hp-dependent SDE) that serves as a continuous
surrogate for SGD with momentum. Similarly, we establish the linear convergence
for the continuous-time formulation of SGD with momentum and obtain an explicit
expression for the optimal linear rate by analyzing the spectrum of the
Kramers-Fokker-Planck operator. By comparison, we demonstrate how the optimal
linear rate of convergence and the final gap for SGD only about the learning
rate varies with the momentum coefficient increasing from zero to one when the
momentum is introduced. Then, we propose a mathematical interpretation why the
SGD with momentum converges faster and more robust about the learning rate than
the standard SGD in practice. Finally, we show the Nesterov momentum under the
existence of noise has no essential difference with the standard momentum.
Related papers
- Distributed Stochastic Gradient Descent with Staleness: A Stochastic Delay Differential Equation Based Framework [56.82432591933544]
Distributed gradient descent (SGD) has attracted considerable recent attention due to its potential for scaling computational resources, reducing training time, and helping protect user privacy in machine learning.
This paper presents the run time and staleness of distributed SGD based on delay differential equations (SDDEs) and the approximation of gradient arrivals.
It is interestingly shown that increasing the number of activated workers does not necessarily accelerate distributed SGD due to staleness.
arXiv Detail & Related papers (2024-06-17T02:56:55Z) - Faster Convergence of Stochastic Accelerated Gradient Descent under Interpolation [51.248784084461334]
We prove new convergence rates for a generalized version of Nesterov acceleration underrho conditions.
Our analysis reduces the dependence on the strong growth constant from $$ to $sqrt$ as compared to prior work.
arXiv Detail & Related papers (2024-04-03T00:41:19Z) - Role of Momentum in Smoothing Objective Function and Generalizability of Deep Neural Networks [0.6906005491572401]
We show that noise in gradient descent (SGD) with momentum smoothes the objective function, the degree of which is determined by the learning rate, the batch size, the momentum factor, and the upper bound of the norm.
We also provide experimental results supporting our assertion model generalizability depends on the noise level.
arXiv Detail & Related papers (2024-02-04T02:48:28Z) - NAG-GS: Semi-Implicit, Accelerated and Robust Stochastic Optimizer [45.47667026025716]
We propose a novel, robust and accelerated iteration that relies on two key elements.
The convergence and stability of the obtained method, referred to as NAG-GS, are first studied extensively.
We show that NAG-arity is competitive with state-the-art methods such as momentum SGD with weight decay and AdamW for the training of machine learning models.
arXiv Detail & Related papers (2022-09-29T16:54:53Z) - Convex Analysis of the Mean Field Langevin Dynamics [49.66486092259375]
convergence rate analysis of the mean field Langevin dynamics is presented.
$p_q$ associated with the dynamics allows us to develop a convergence theory parallel to classical results in convex optimization.
arXiv Detail & Related papers (2022-01-25T17:13:56Z) - On the Convergence of Stochastic Extragradient for Bilinear Games with
Restarted Iteration Averaging [96.13485146617322]
We present an analysis of the ExtraGradient (SEG) method with constant step size, and present variations of the method that yield favorable convergence.
We prove that when augmented with averaging, SEG provably converges to the Nash equilibrium, and such a rate is provably accelerated by incorporating a scheduled restarting procedure.
arXiv Detail & Related papers (2021-06-30T17:51:36Z) - Momentum via Primal Averaging: Theoretical Insights and Learning Rate
Schedules for Non-Convex Optimization [10.660480034605241]
Momentum methods are now used pervasively within the machine learning community for non-training models such as deep neural networks.
In this work we develop a Lyapunov analysis of SGD with momentum, by utilizing the SGD equivalent rewriting of the primal SGD method known as the SGDSPA) form.
arXiv Detail & Related papers (2020-10-01T13:46:32Z) - On Learning Rates and Schr\"odinger Operators [105.32118775014015]
We present a general theoretical analysis of the effect of the learning rate.
We find that the learning rate tends to zero for a broad non- neural class functions.
arXiv Detail & Related papers (2020-04-15T09:52:37Z) - Convergence rates and approximation results for SGD and its
continuous-time counterpart [16.70533901524849]
This paper proposes a thorough theoretical analysis of convex Gradient Descent (SGD) with non-increasing step sizes.
First, we show that the SGD can be provably approximated by solutions of inhomogeneous Differential Equation (SDE) using coupling.
Recent analyses of deterministic and optimization methods by their continuous counterpart, we study the long-time behavior of the continuous processes at hand and non-asymptotic bounds.
arXiv Detail & Related papers (2020-04-08T18:31:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.