Convergence rates for momentum stochastic gradient descent with noise of
machine learning type
- URL: http://arxiv.org/abs/2302.03550v1
- Date: Tue, 7 Feb 2023 15:59:08 GMT
- Title: Convergence rates for momentum stochastic gradient descent with noise of
machine learning type
- Authors: Benjamin Gess, Sebastian Kassing
- Abstract summary: We consider the momentum of descent scheme (MSGD) and its continuous-in-time counterpart.
We show almost exponential convergence of objective function value for target functions.
- Score: 1.4213973379473654
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the momentum stochastic gradient descent scheme (MSGD) and its
continuous-in-time counterpart in the context of non-convex optimization. We
show almost sure exponential convergence of the objective function value for
target functions that are Lipschitz continuous and satisfy the
Polyak-Lojasiewicz inequality on the relevant domain, and under assumptions on
the stochastic noise that are motivated by overparameterized supervised
learning applications. Moreover, we optimize the convergence rate over the set
of friction parameters and show that the MSGD process almost surely converges.
Related papers
- More Optimal Fractional-Order Stochastic Gradient Descent for Non-Convex Optimization Problems [2.5971517743176915]
We propose 2SED Fractional-Order Gradient Descent (2FOSGD), which integrates the Two-Scale Effective Dimension (2SED) with FOSGD.<n>By tracking sensitivity and effective dimensionality, 2SEDFOSGD dynamically modulates the exponent to mitigate sluggish oscillations and hasten convergence.
arXiv Detail & Related papers (2025-05-05T19:27:36Z) - Stochastic Optimization with Optimal Importance Sampling [49.484190237840714]
We propose an iterative-based algorithm that jointly updates the decision and the IS distribution without requiring time-scale separation between the two.
Our method achieves the lowest possible variable variance and guarantees global convergence under convexity of the objective and mild assumptions on the IS distribution family.
arXiv Detail & Related papers (2025-04-04T16:10:18Z) - Emergence of heavy tails in homogenized stochastic gradient descent [1.450405446885067]
Loss by gradient descent (SGD) leads to heavy-tailed network parameters.
We analyze a continuous diffusion approximation of SGD, called homogenized gradient descent.
We quantify the interplay between optimization parameters and the tail-index.
arXiv Detail & Related papers (2024-02-02T13:06:33Z) - Convergence of mean-field Langevin dynamics: Time and space
discretization, stochastic gradient, and variance reduction [49.66486092259376]
The mean-field Langevin dynamics (MFLD) is a nonlinear generalization of the Langevin dynamics that incorporates a distribution-dependent drift.
Recent works have shown that MFLD globally minimizes an entropy-regularized convex functional in the space of measures.
We provide a framework to prove a uniform-in-time propagation of chaos for MFLD that takes into account the errors due to finite-particle approximation, time-discretization, and gradient approximation.
arXiv Detail & Related papers (2023-06-12T16:28:11Z) - Conservative SPDEs as fluctuating mean field limits of stochastic
gradient descent [1.2031796234206138]
It is shown that the inclusion of fluctuations in the limiting SPDE improves the rate of convergence, and retains information about the fluctuations of descent in the continuum limit.
arXiv Detail & Related papers (2022-07-12T17:27:18Z) - Nonconvex Stochastic Scaled-Gradient Descent and Generalized Eigenvector
Problems [98.34292831923335]
Motivated by the problem of online correlation analysis, we propose the emphStochastic Scaled-Gradient Descent (SSD) algorithm.
We bring these ideas together in an application to online correlation analysis, deriving for the first time an optimal one-time-scale algorithm with an explicit rate of local convergence to normality.
arXiv Detail & Related papers (2021-12-29T18:46:52Z) - Convergence and Stability of the Stochastic Proximal Point Algorithm
with Momentum [14.158845925610438]
We show how a gradient proximal algorithm with momentum (PPA) allows faster convergence to a neighborhood compared to the proximal algorithm (PPA) with better contraction factor.
arXiv Detail & Related papers (2021-11-11T12:17:22Z) - Minibatch vs Local SGD with Shuffling: Tight Convergence Bounds and
Beyond [63.59034509960994]
We study shuffling-based variants: minibatch and local Random Reshuffling, which draw gradients without replacement.
For smooth functions satisfying the Polyak-Lojasiewicz condition, we obtain convergence bounds which show that these shuffling-based variants converge faster than their with-replacement counterparts.
We propose an algorithmic modification called synchronized shuffling that leads to convergence rates faster than our lower bounds in near-homogeneous settings.
arXiv Detail & Related papers (2021-10-20T02:25:25Z) - Differentiable Annealed Importance Sampling and the Perils of Gradient
Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation.
Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective.
We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z) - On the Convergence of Stochastic Extragradient for Bilinear Games with
Restarted Iteration Averaging [96.13485146617322]
We present an analysis of the ExtraGradient (SEG) method with constant step size, and present variations of the method that yield favorable convergence.
We prove that when augmented with averaging, SEG provably converges to the Nash equilibrium, and such a rate is provably accelerated by incorporating a scheduled restarting procedure.
arXiv Detail & Related papers (2021-06-30T17:51:36Z) - Stochastic Optimization with Heavy-Tailed Noise via Accelerated Gradient
Clipping [69.9674326582747]
We propose a new accelerated first-order method called clipped-SSTM for smooth convex optimization with heavy-tailed distributed noise in gradients.
We prove new complexity that outperform state-of-the-art results in this case.
We derive the first non-trivial high-probability complexity bounds for SGD with clipping without light-tails assumption on the noise.
arXiv Detail & Related papers (2020-05-21T17:05:27Z) - Convergence rates and approximation results for SGD and its
continuous-time counterpart [16.70533901524849]
This paper proposes a thorough theoretical analysis of convex Gradient Descent (SGD) with non-increasing step sizes.
First, we show that the SGD can be provably approximated by solutions of inhomogeneous Differential Equation (SDE) using coupling.
Recent analyses of deterministic and optimization methods by their continuous counterpart, we study the long-time behavior of the continuous processes at hand and non-asymptotic bounds.
arXiv Detail & Related papers (2020-04-08T18:31:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.