Understanding and Detecting Convergence for Stochastic Gradient Descent
with Momentum
- URL: http://arxiv.org/abs/2008.12224v1
- Date: Thu, 27 Aug 2020 16:24:18 GMT
- Title: Understanding and Detecting Convergence for Stochastic Gradient Descent
with Momentum
- Authors: Jerry Chee and Ping Li
- Abstract summary: This paper considers gradient descent with a constant learning rate and momentum.
We construct a statistical diagnostic test for convergence to the stationary phase using the inner product between successive gradients.
- Score: 18.88380216480131
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Convergence detection of iterative stochastic optimization methods is of
great practical interest. This paper considers stochastic gradient descent
(SGD) with a constant learning rate and momentum. We show that there exists a
transient phase in which iterates move towards a region of interest, and a
stationary phase in which iterates remain bounded in that region around a
minimum point. We construct a statistical diagnostic test for convergence to
the stationary phase using the inner product between successive gradients and
demonstrate that the proposed diagnostic works well. We theoretically and
empirically characterize how momentum can affect the test statistic of the
diagnostic, and how the test statistic captures a relatively sparse signal
within the gradients in convergence. Finally, we demonstrate an application to
automatically tune the learning rate by reducing it each time stationarity is
detected, and show the procedure is robust to mis-specified initial rates.
Related papers
- One-step corrected projected stochastic gradient descent for statistical estimation [49.1574468325115]
It is based on the projected gradient descent on the log-likelihood function corrected by a single step of the Fisher scoring algorithm.
We show theoretically and by simulations that it is an interesting alternative to the usual gradient descent with averaging or the adaptative gradient descent.
arXiv Detail & Related papers (2023-06-09T13:43:07Z) - Score-based Continuous-time Discrete Diffusion Models [102.65769839899315]
We extend diffusion models to discrete variables by introducing a Markov jump process where the reverse process denoises via a continuous-time Markov chain.
We show that an unbiased estimator can be obtained via simple matching the conditional marginal distributions.
We demonstrate the effectiveness of the proposed method on a set of synthetic and real-world music and image benchmarks.
arXiv Detail & Related papers (2022-11-30T05:33:29Z) - On Almost Sure Convergence Rates of Stochastic Gradient Methods [11.367487348673793]
We show for the first time that the almost sure convergence rates obtained for gradient methods are arbitrarily close to their optimal convergence rates possible.
For non- objective functions, we only show that a weighted average of squared gradient norms converges not almost surely, but also lasts to zero almost surely.
arXiv Detail & Related papers (2022-02-09T06:05:30Z) - A Continuous-time Stochastic Gradient Descent Method for Continuous Data [0.0]
We study a continuous-time variant of the gradient descent algorithm for optimization problems with continuous data.
We study multiple sampling patterns for the continuous data space and allow for data simulated or streamed at runtime.
We end with illustrating the applicability of the gradient process in a regression problem with noisy functional data, as well as in a physics-informed neural network.
arXiv Detail & Related papers (2021-12-07T15:09:24Z) - Accelerated Almost-Sure Convergence Rates for Nonconvex Stochastic
Gradient Descent using Stochastic Learning Rates [0.0]
This paper uses almost-sure convergence rates of gradient-sure convergence rates of Gradient Descent to solve large-scale optimization problems.
In particular, its learning rate is equipped with a multiplicativeity learning rate.
arXiv Detail & Related papers (2021-10-25T04:27:35Z) - Differentiable Annealed Importance Sampling and the Perils of Gradient
Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation.
Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective.
We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z) - On the Convergence of Stochastic Extragradient for Bilinear Games with
Restarted Iteration Averaging [96.13485146617322]
We present an analysis of the ExtraGradient (SEG) method with constant step size, and present variations of the method that yield favorable convergence.
We prove that when augmented with averaging, SEG provably converges to the Nash equilibrium, and such a rate is provably accelerated by incorporating a scheduled restarting procedure.
arXiv Detail & Related papers (2021-06-30T17:51:36Z) - On Convergence-Diagnostic based Step Sizes for Stochastic Gradient
Descent [24.042107117994046]
Constant step-size Gradient Descent exhibits two phases: a transient phase during which iterates make fast progress towards the optimum, followed by a stationary phase during which iterates oscillate around the optimal point.
We show that efficiently detecting this transition and appropriately decreasing the step size can lead to fast convergence rates.
arXiv Detail & Related papers (2020-07-01T14:58:01Z) - Analysis of Stochastic Gradient Descent in Continuous Time [0.0]
We introduce the gradient process as a continuous-time representation of gradient descent.
We show that it converges weakly to the gradient flow as the learning rate approaches zero.
In this case, the process converges weakly to the point mass concentrated in the global minimum of the full target function.
arXiv Detail & Related papers (2020-04-15T16:04:41Z) - On Learning Rates and Schr\"odinger Operators [105.32118775014015]
We present a general theoretical analysis of the effect of the learning rate.
We find that the learning rate tends to zero for a broad non- neural class functions.
arXiv Detail & Related papers (2020-04-15T09:52:37Z) - SLEIPNIR: Deterministic and Provably Accurate Feature Expansion for
Gaussian Process Regression with Derivatives [86.01677297601624]
We propose a novel approach for scaling GP regression with derivatives based on quadrature Fourier features.
We prove deterministic, non-asymptotic and exponentially fast decaying error bounds which apply for both the approximated kernel as well as the approximated posterior.
arXiv Detail & Related papers (2020-03-05T14:33:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.