Convergence rates and approximation results for SGD and its
continuous-time counterpart
- URL: http://arxiv.org/abs/2004.04193v2
- Date: Fri, 29 Jan 2021 21:39:41 GMT
- Title: Convergence rates and approximation results for SGD and its
continuous-time counterpart
- Authors: Xavier Fontaine, Valentin De Bortoli, and Alain Durmus
- Abstract summary: This paper proposes a thorough theoretical analysis of convex Gradient Descent (SGD) with non-increasing step sizes.
First, we show that the SGD can be provably approximated by solutions of inhomogeneous Differential Equation (SDE) using coupling.
Recent analyses of deterministic and optimization methods by their continuous counterpart, we study the long-time behavior of the continuous processes at hand and non-asymptotic bounds.
- Score: 16.70533901524849
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes a thorough theoretical analysis of Stochastic Gradient
Descent (SGD) with non-increasing step sizes. First, we show that the recursion
defining SGD can be provably approximated by solutions of a time inhomogeneous
Stochastic Differential Equation (SDE) using an appropriate coupling. In the
specific case of a batch noise we refine our results using recent advances in
Stein's method. Then, motivated by recent analyses of deterministic and
stochastic optimization methods by their continuous counterpart, we study the
long-time behavior of the continuous processes at hand and establish
non-asymptotic bounds. To that purpose, we develop new comparison techniques
which are of independent interest. Adapting these techniques to the discrete
setting, we show that the same results hold for the corresponding SGD
sequences. In our analysis, we notably improve non-asymptotic bounds in the
convex setting for SGD under weaker assumptions than the ones considered in
previous works. Finally, we also establish finite-time convergence results
under various conditions, including relaxations of the famous {\L}ojasiewicz
inequality, which can be applied to a class of non-convex functions.
Related papers
- Asymptotic and Non-Asymptotic Convergence Analysis of AdaGrad for Non-Convex Optimization via Novel Stopping Time-based Analysis [17.34603953600226]
Adaptives have emerged as powerful tools in deep learning, dynamically adjusting the learning rate based on gradient.
These methods have significantly succeeded in various deep learning tasks, but AdaGrad is the cornerstone of this work.
arXiv Detail & Related papers (2024-09-08T08:29:51Z) - Utilising the CLT Structure in Stochastic Gradient based Sampling :
Improved Analysis and Faster Algorithms [14.174806471635403]
We consider approximations of sampling algorithms, such as Gradient Langevin Dynamics (SGLD) and the Random Batch Method (RBM) for Interacting Particle Dynamcs (IPD)
We observe that the noise introduced by the approximation is nearly Gaussian due to the Central Limit Theorem (CLT) while the driving Brownian motion is exactly Gaussian.
We harness this structure to absorb the approximation error inside the diffusion process, and obtain improved convergence guarantees for these algorithms.
arXiv Detail & Related papers (2022-06-08T10:17:40Z) - Clipped Stochastic Methods for Variational Inequalities with
Heavy-Tailed Noise [64.85879194013407]
We prove the first high-probability results with logarithmic dependence on the confidence level for methods for solving monotone and structured non-monotone VIPs.
Our results match the best-known ones in the light-tails case and are novel for structured non-monotone problems.
In addition, we numerically validate that the gradient noise of many practical formulations is heavy-tailed and show that clipping improves the performance of SEG/SGDA.
arXiv Detail & Related papers (2022-06-02T15:21:55Z) - Computing the Variance of Shuffling Stochastic Gradient Algorithms via
Power Spectral Density Analysis [6.497816402045099]
Two common alternatives to gradient descent (SGD) with theoretical benefits are random reshuffling (SGDRR) and shuffle-once (SGD-SO)
We study the stationary variances of SGD, SGDRR and SGD-SO, whose leading terms decrease in this order, and obtain simple approximations.
arXiv Detail & Related papers (2022-06-01T17:08:04Z) - Differential Privacy Guarantees for Stochastic Gradient Langevin
Dynamics [2.9477900773805032]
We show that the privacy loss converges exponentially fast for smooth and strongly convex objectives under constant step size.
We propose an implementation and our experiments show the practical utility of our approach compared to classical DP-SGD libraries.
arXiv Detail & Related papers (2022-01-28T08:21:31Z) - On the Convergence of mSGD and AdaGrad for Stochastic Optimization [0.696125353550498]
convex descent (SGD) has been intensively developed and extensively applied in machine learning in the past decade.
Some modified SGD-type algorithms, which outperform the SGD in many competitions and applications in terms of convergence rate and accuracy, such as momentum-based SGD (mSGD) and adaptive gradient optimization (AdaGrad)
We focus on convergence analysis of mSGD and AdaGrad for any smooth (possibly non-possibly non-possibly non-possibly) loss functions in machine learning.
arXiv Detail & Related papers (2022-01-26T22:02:21Z) - Differentiable Annealed Importance Sampling and the Perils of Gradient
Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation.
Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective.
We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z) - Stochastic Gradient Descent-Ascent and Consensus Optimization for Smooth
Games: Convergence Analysis under Expected Co-coercivity [49.66890309455787]
We introduce the expected co-coercivity condition, explain its benefits, and provide the first last-iterate convergence guarantees of SGDA and SCO.
We prove linear convergence of both methods to a neighborhood of the solution when they use constant step-size.
Our convergence guarantees hold under the arbitrary sampling paradigm, and we give insights into the complexity of minibatching.
arXiv Detail & Related papers (2021-06-30T18:32:46Z) - On the Convergence of Stochastic Extragradient for Bilinear Games with
Restarted Iteration Averaging [96.13485146617322]
We present an analysis of the ExtraGradient (SEG) method with constant step size, and present variations of the method that yield favorable convergence.
We prove that when augmented with averaging, SEG provably converges to the Nash equilibrium, and such a rate is provably accelerated by incorporating a scheduled restarting procedure.
arXiv Detail & Related papers (2021-06-30T17:51:36Z) - ROOT-SGD: Sharp Nonasymptotics and Near-Optimal Asymptotics in a Single Algorithm [71.13558000599839]
We study the problem of solving strongly convex and smooth unconstrained optimization problems using first-order algorithms.
We devise a novel, referred to as Recursive One-Over-T SGD, based on an easily implementable, averaging of past gradients.
We prove that it simultaneously achieves state-of-the-art performance in both a finite-sample, nonasymptotic sense and an sense.
arXiv Detail & Related papers (2020-08-28T14:46:56Z) - Is Temporal Difference Learning Optimal? An Instance-Dependent Analysis [102.29671176698373]
We address the problem of policy evaluation in discounted decision processes, and provide Markov-dependent guarantees on the $ell_infty$error under a generative model.
We establish both and non-asymptotic versions of local minimax lower bounds for policy evaluation, thereby providing an instance-dependent baseline by which to compare algorithms.
arXiv Detail & Related papers (2020-03-16T17:15:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.