Related papers: Towards Continuous-Time Approximations for Stochastic Gradient Descent without Replacement

Towards Continuous-Time Approximations for Stochastic Gradient Descent without Replacement

URL: http://arxiv.org/abs/2512.04703v1
Date: Thu, 04 Dec 2025 11:52:20 GMT
Title: Towards Continuous-Time Approximations for Stochastic Gradient Descent without Replacement
Authors: Stefan Perko,
Abstract summary: Gradient optimization algorithms using epochs, that is those based on descent without replacement (SGDo), are predominantly used to train machine learning models in practice.<n>We propose a continuous-time approximation to SGDo with additive noise based on a Young differential equation driven by a process we call an "epoched Brownian motion"<n>We show its usefulness by proving the almost sure convergence of the continuous-time approximation for strongly convex objectives and learning rate schedules of the form $u_t = frac1(+t), in (0,1)$.
Score: 1.7767466724342065
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Gradient optimization algorithms using epochs, that is those based on stochastic gradient descent without replacement (SGDo), are predominantly used to train machine learning models in practice. However, the mathematical theory of SGDo and related algorithms remain underexplored compared to their "with replacement" and "one-pass" counterparts. In this article, we propose a stochastic, continuous-time approximation to SGDo with additive noise based on a Young differential equation driven by a stochastic process we call an "epoched Brownian motion". We show its usefulness by proving the almost sure convergence of the continuous-time approximation for strongly convex objectives and learning rate schedules of the form $u_t = \frac{1}{(1+t)^β}, β\in (0,1)$. Moreover, we compute an upper bound on the asymptotic rate of almost sure convergence, which is as good or better than previous results for SGDo.

Related papers

Convergence of Momentum-Based Optimization Algorithms with Time-Varying Parameters [0.0]
We present a unified algorithm for optimization that makes use of a "momentum" term.<n>The gradient depends not only on the current true gradient of the objective function, but also on the true gradient at the previous iteration.
arXiv Detail & Related papers (2025-06-13T15:53:17Z)
Gaussian Approximation and Multiplier Bootstrap for Stochastic Gradient Descent [14.19520637866741]
We establish the non-asymptotic validity of the multiplier bootstrap procedure for constructing confidence sets.<n>We derive approximation rates in convex distance of order up to $1/sqrtn$.
arXiv Detail & Related papers (2025-02-10T17:49:05Z)
Limit Theorems for Stochastic Gradient Descent with Infinite Variance [51.4853131023238]
We show that the gradient descent algorithm can be characterized as the stationary distribution of a suitably defined Ornstein-rnstein process driven by an appropriate L'evy process.<n>We also explore the applications of these results in linear regression and logistic regression models.
arXiv Detail & Related papers (2024-10-21T09:39:10Z)
One-step corrected projected stochastic gradient descent for statistical estimation [49.1574468325115]
It is based on the projected gradient descent on the log-likelihood function corrected by a single step of the Fisher scoring algorithm. We show theoretically and by simulations that it is an interesting alternative to the usual gradient descent with averaging or the adaptative gradient descent.
arXiv Detail & Related papers (2023-06-09T13:43:07Z)
Convergence of Batch Stochastic Gradient Descent Methods with Approximate Gradients and/or Noisy Measurements: Theory and Computational Results [0.9900482274337404]
We study convex optimization using a very general formulation called BSGD (Block Gradient Descent) We establish conditions for BSGD to converge to the global minimum, based on approximation theory. We show that when approximate gradients are used, BSGD converges while momentum-based methods can diverge.
arXiv Detail & Related papers (2022-09-12T16:23:15Z)
Exponential Concentration in Stochastic Approximation [0.8192907805418583]
We analyze the behavior of approximation algorithms where iterates, in expectation, progress towards an objective at each step. We apply our results to several different Markov Approximation algorithms, specifically Projected Gradient Descent, Kiefer-Wolfowitz and Frank-Wolfe algorithms.
arXiv Detail & Related papers (2022-08-15T14:57:26Z)
Optimal Extragradient-Based Bilinearly-Coupled Saddle-Point Optimization [116.89941263390769]
We consider the smooth convex-concave bilinearly-coupled saddle-point problem, $min_mathbfxmax_mathbfyF(mathbfx) + H(mathbfx,mathbfy)$, where one has access to first-order oracles for $F$, $G$ as well as the bilinear coupling function $H$. We present a emphaccelerated gradient-extragradient (AG-EG) descent-ascent algorithm that combines extragrad
arXiv Detail & Related papers (2022-06-17T06:10:20Z)
Utilising the CLT Structure in Stochastic Gradient based Sampling : Improved Analysis and Faster Algorithms [14.174806471635403]
We consider approximations of sampling algorithms, such as Gradient Langevin Dynamics (SGLD) and the Random Batch Method (RBM) for Interacting Particle Dynamcs (IPD) We observe that the noise introduced by the approximation is nearly Gaussian due to the Central Limit Theorem (CLT) while the driving Brownian motion is exactly Gaussian. We harness this structure to absorb the approximation error inside the diffusion process, and obtain improved convergence guarantees for these algorithms.
arXiv Detail & Related papers (2022-06-08T10:17:40Z)
On the Convergence of mSGD and AdaGrad for Stochastic Optimization [0.696125353550498]
convex descent (SGD) has been intensively developed and extensively applied in machine learning in the past decade. Some modified SGD-type algorithms, which outperform the SGD in many competitions and applications in terms of convergence rate and accuracy, such as momentum-based SGD (mSGD) and adaptive gradient optimization (AdaGrad) We focus on convergence analysis of mSGD and AdaGrad for any smooth (possibly non-possibly non-possibly non-possibly) loss functions in machine learning.
arXiv Detail & Related papers (2022-01-26T22:02:21Z)
Differentiable Annealed Importance Sampling and the Perils of Gradient Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation. Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective. We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z)
High-probability Bounds for Non-Convex Stochastic Optimization with Heavy Tails [55.561406656549686]
We consider non- Hilbert optimization using first-order algorithms for which the gradient estimates may have tails. We show that a combination of gradient, momentum, and normalized gradient descent convergence to critical points in high-probability with best-known iteration for smooth losses.
arXiv Detail & Related papers (2021-06-28T00:17:01Z)
ROOT-SGD: Sharp Nonasymptotics and Near-Optimal Asymptotics in a Single Algorithm [71.13558000599839]
We study the problem of solving strongly convex and smooth unconstrained optimization problems using first-order algorithms. We devise a novel, referred to as Recursive One-Over-T SGD, based on an easily implementable, averaging of past gradients. We prove that it simultaneously achieves state-of-the-art performance in both a finite-sample, nonasymptotic sense and an sense.
arXiv Detail & Related papers (2020-08-28T14:46:56Z)
Stochastic Optimization with Heavy-Tailed Noise via Accelerated Gradient Clipping [69.9674326582747]
We propose a new accelerated first-order method called clipped-SSTM for smooth convex optimization with heavy-tailed distributed noise in gradients. We prove new complexity that outperform state-of-the-art results in this case. We derive the first non-trivial high-probability complexity bounds for SGD with clipping without light-tails assumption on the noise.
arXiv Detail & Related papers (2020-05-21T17:05:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.