Related papers: Understanding and Detecting Convergence for Stochastic Gradient Descent with Momentum

Understanding and Detecting Convergence for Stochastic Gradient Descent with Momentum

URL: http://arxiv.org/abs/2008.12224v1
Date: Thu, 27 Aug 2020 16:24:18 GMT
Title: Understanding and Detecting Convergence for Stochastic Gradient Descent with Momentum
Authors: Jerry Chee and Ping Li
Abstract summary: This paper considers gradient descent with a constant learning rate and momentum. We construct a statistical diagnostic test for convergence to the stationary phase using the inner product between successive gradients.
Score: 18.88380216480131
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Convergence detection of iterative stochastic optimization methods is of great practical interest. This paper considers stochastic gradient descent (SGD) with a constant learning rate and momentum. We show that there exists a transient phase in which iterates move towards a region of interest, and a stationary phase in which iterates remain bounded in that region around a minimum point. We construct a statistical diagnostic test for convergence to the stationary phase using the inner product between successive gradients and demonstrate that the proposed diagnostic works well. We theoretically and empirically characterize how momentum can affect the test statistic of the diagnostic, and how the test statistic captures a relatively sparse signal within the gradients in convergence. Finally, we demonstrate an application to automatically tune the learning rate by reducing it each time stationarity is detected, and show the procedure is robust to mis-specified initial rates.

Related papers

Coupling-based Convergence Diagnostic and Stepsize Scheme for Stochastic Gradient Descent [9.69064563836298]
The convergence behavior of Gradient Descent (SGD) crucially depends on the stepsize configuration. We propose a novel coupling-based diagnostic procedure, which is shown to track the transition from transience stationarity theoretically.
arXiv Detail & Related papers (2024-12-15T23:50:23Z)
Convergence of Score-Based Discrete Diffusion Models: A Discrete-Time Analysis [56.442307356162864]
We study the theoretical aspects of score-based discrete diffusion models under the Continuous Time Markov Chain (CTMC) framework. We introduce a discrete-time sampling algorithm in the general state space $[S]d$ that utilizes score estimators at predefined time points. Our convergence analysis employs a Girsanov-based method and establishes key properties of the discrete score function.
arXiv Detail & Related papers (2024-10-03T09:07:13Z)
One-step corrected projected stochastic gradient descent for statistical estimation [49.1574468325115]
It is based on the projected gradient descent on the log-likelihood function corrected by a single step of the Fisher scoring algorithm. We show theoretically and by simulations that it is an interesting alternative to the usual gradient descent with averaging or the adaptative gradient descent.
arXiv Detail & Related papers (2023-06-09T13:43:07Z)
Score-based Continuous-time Discrete Diffusion Models [102.65769839899315]
We extend diffusion models to discrete variables by introducing a Markov jump process where the reverse process denoises via a continuous-time Markov chain. We show that an unbiased estimator can be obtained via simple matching the conditional marginal distributions. We demonstrate the effectiveness of the proposed method on a set of synthetic and real-world music and image benchmarks.
arXiv Detail & Related papers (2022-11-30T05:33:29Z)
On Almost Sure Convergence Rates of Stochastic Gradient Methods [11.367487348673793]
We show for the first time that the almost sure convergence rates obtained for gradient methods are arbitrarily close to their optimal convergence rates possible. For non- objective functions, we only show that a weighted average of squared gradient norms converges not almost surely, but also lasts to zero almost surely.
arXiv Detail & Related papers (2022-02-09T06:05:30Z)
A Continuous-time Stochastic Gradient Descent Method for Continuous Data [0.0]
We study a continuous-time variant of the gradient descent algorithm for optimization problems with continuous data. We study multiple sampling patterns for the continuous data space and allow for data simulated or streamed at runtime. We end with illustrating the applicability of the gradient process in a regression problem with noisy functional data, as well as in a physics-informed neural network.
arXiv Detail & Related papers (2021-12-07T15:09:24Z)
Accelerated Almost-Sure Convergence Rates for Nonconvex Stochastic Gradient Descent using Stochastic Learning Rates [0.0]
This paper uses almost-sure convergence rates of gradient-sure convergence rates of Gradient Descent to solve large-scale optimization problems. In particular, its learning rate is equipped with a multiplicativeity learning rate.
arXiv Detail & Related papers (2021-10-25T04:27:35Z)
Differentiable Annealed Importance Sampling and the Perils of Gradient Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation. Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective. We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z)
On the Convergence of Stochastic Extragradient for Bilinear Games with Restarted Iteration Averaging [96.13485146617322]
We present an analysis of the ExtraGradient (SEG) method with constant step size, and present variations of the method that yield favorable convergence. We prove that when augmented with averaging, SEG provably converges to the Nash equilibrium, and such a rate is provably accelerated by incorporating a scheduled restarting procedure.
arXiv Detail & Related papers (2021-06-30T17:51:36Z)
On Convergence-Diagnostic based Step Sizes for Stochastic Gradient Descent [24.042107117994046]
Constant step-size Gradient Descent exhibits two phases: a transient phase during which iterates make fast progress towards the optimum, followed by a stationary phase during which iterates oscillate around the optimal point. We show that efficiently detecting this transition and appropriately decreasing the step size can lead to fast convergence rates.
arXiv Detail & Related papers (2020-07-01T14:58:01Z)
Analysis of Stochastic Gradient Descent in Continuous Time [0.0]
We introduce the gradient process as a continuous-time representation of gradient descent. We show that it converges weakly to the gradient flow as the learning rate approaches zero. In this case, the process converges weakly to the point mass concentrated in the global minimum of the full target function.
arXiv Detail & Related papers (2020-04-15T16:04:41Z)
On Learning Rates and Schr\"odinger Operators [105.32118775014015]
We present a general theoretical analysis of the effect of the learning rate. We find that the learning rate tends to zero for a broad non- neural class functions.
arXiv Detail & Related papers (2020-04-15T09:52:37Z)
SLEIPNIR: Deterministic and Provably Accurate Feature Expansion for Gaussian Process Regression with Derivatives [86.01677297601624]
We propose a novel approach for scaling GP regression with derivatives based on quadrature Fourier features. We prove deterministic, non-asymptotic and exponentially fast decaying error bounds which apply for both the approximated kernel as well as the approximated posterior.
arXiv Detail & Related papers (2020-03-05T14:33:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.