Related papers: Computing the Variance of Shuffling Stochastic Gradient Algorithms via Power Spectral Density Analysis

Computing the Variance of Shuffling Stochastic Gradient Algorithms via Power Spectral Density Analysis

URL: http://arxiv.org/abs/2206.00632v1
Date: Wed, 1 Jun 2022 17:08:04 GMT
Title: Computing the Variance of Shuffling Stochastic Gradient Algorithms via Power Spectral Density Analysis
Authors: Carles Domingo-Enrich
Abstract summary: Two common alternatives to gradient descent (SGD) with theoretical benefits are random reshuffling (SGDRR) and shuffle-once (SGD-SO) We study the stationary variances of SGD, SGDRR and SGD-SO, whose leading terms decrease in this order, and obtain simple approximations.
Score: 6.497816402045099
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: When solving finite-sum minimization problems, two common alternatives to stochastic gradient descent (SGD) with theoretical benefits are random reshuffling (SGD-RR) and shuffle-once (SGD-SO), in which functions are sampled in cycles without replacement. Under a convenient stochastic noise approximation which holds experimentally, we study the stationary variances of the iterates of SGD, SGD-RR and SGD-SO, whose leading terms decrease in this order, and obtain simple approximations. To obtain our results, we study the power spectral density of the stochastic gradient noise sequences. Our analysis extends beyond SGD to SGD with momentum and to the stochastic Nesterov's accelerated gradient method. We perform experiments on quadratic objective functions to test the validity of our approximation and the correctness of our findings.

Related papers

Using Stochastic Gradient Descent to Smooth Nonconvex Functions: Analysis of Implicit Graduated Optimization [0.6906005491572401]
We show that noise in batch descent gradient (SGD) has the effect of smoothing objective function. We analyze a new graduated optimization algorithm that varies the degree of smoothing by learning rate and batch size.
arXiv Detail & Related papers (2023-11-15T07:27:40Z)
Doubly Stochastic Models: Learning with Unbiased Label Noises and Inference Stability [85.1044381834036]
We investigate the implicit regularization effects of label noises under mini-batch sampling settings of gradient descent. We find such implicit regularizer would favor some convergence points that could stabilize model outputs against perturbation of parameters. Our work doesn't assume SGD as an Ornstein-Uhlenbeck like process and achieve a more general result with convergence of approximation proved.
arXiv Detail & Related papers (2023-04-01T14:09:07Z)
Utilising the CLT Structure in Stochastic Gradient based Sampling : Improved Analysis and Faster Algorithms [14.174806471635403]
We consider approximations of sampling algorithms, such as Gradient Langevin Dynamics (SGLD) and the Random Batch Method (RBM) for Interacting Particle Dynamcs (IPD) We observe that the noise introduced by the approximation is nearly Gaussian due to the Central Limit Theorem (CLT) while the driving Brownian motion is exactly Gaussian. We harness this structure to absorb the approximation error inside the diffusion process, and obtain improved convergence guarantees for these algorithms.
arXiv Detail & Related papers (2022-06-08T10:17:40Z)
On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD) We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting. We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z)
Differentiable Annealed Importance Sampling and the Perils of Gradient Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation. Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective. We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z)
On the Convergence of Stochastic Extragradient for Bilinear Games with Restarted Iteration Averaging [96.13485146617322]
We present an analysis of the ExtraGradient (SEG) method with constant step size, and present variations of the method that yield favorable convergence. We prove that when augmented with averaging, SEG provably converges to the Nash equilibrium, and such a rate is provably accelerated by incorporating a scheduled restarting procedure.
arXiv Detail & Related papers (2021-06-30T17:51:36Z)
Noise and Fluctuation of Finite Learning Rate Stochastic Gradient Descent [3.0079490585515343]
gradient descent (SGD) is relatively well understood in the vanishing learning rate regime. We propose to study the basic properties of SGD and its variants in the non-vanishing learning rate regime.
arXiv Detail & Related papers (2020-12-07T12:31:43Z)
Faster Convergence of Stochastic Gradient Langevin Dynamics for Non-Log-Concave Sampling [110.88857917726276]
We provide a new convergence analysis of gradient Langevin dynamics (SGLD) for sampling from a class of distributions that can be non-log-concave. At the core of our approach is a novel conductance analysis of SGLD using an auxiliary time-reversible Markov Chain.
arXiv Detail & Related papers (2020-10-19T15:23:18Z)
Stochastic Optimization with Heavy-Tailed Noise via Accelerated Gradient Clipping [69.9674326582747]
We propose a new accelerated first-order method called clipped-SSTM for smooth convex optimization with heavy-tailed distributed noise in gradients. We prove new complexity that outperform state-of-the-art results in this case. We derive the first non-trivial high-probability complexity bounds for SGD with clipping without light-tails assumption on the noise.
arXiv Detail & Related papers (2020-05-21T17:05:27Z)
Convergence rates and approximation results for SGD and its continuous-time counterpart [16.70533901524849]
This paper proposes a thorough theoretical analysis of convex Gradient Descent (SGD) with non-increasing step sizes. First, we show that the SGD can be provably approximated by solutions of inhomogeneous Differential Equation (SDE) using coupling. Recent analyses of deterministic and optimization methods by their continuous counterpart, we study the long-time behavior of the continuous processes at hand and non-asymptotic bounds.
arXiv Detail & Related papers (2020-04-08T18:31:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.