Non Asymptotic Bounds for Optimization via Online Multiplicative
Stochastic Gradient Descent
- URL: http://arxiv.org/abs/2112.07110v2
- Date: Wed, 15 Dec 2021 10:09:19 GMT
- Title: Non Asymptotic Bounds for Optimization via Online Multiplicative
Stochastic Gradient Descent
- Authors: Riddhiman Bhattacharya
- Abstract summary: The gradient noise of Gradient Descent (SGD) is considered to play a key role in its properties.
We show that noise classes that have the same mean and covariance structure of SGD via minibatching have similar properties.
We also establish bounds for the convergence of the M-SGD algorithm in the strongly convex regime.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The gradient noise of Stochastic Gradient Descent (SGD) is considered to play
a key role in its properties (e.g. escaping low potential points and
regularization). Past research has indicated that the covariance of the SGD
error done via minibatching plays a critical role in determining its
regularization and escape from low potential points. It is however not much
explored how much the distribution of the error influences the behavior of the
algorithm. Motivated by some new research in this area, we prove universality
results by showing that noise classes that have the same mean and covariance
structure of SGD via minibatching have similar properties. We mainly consider
the Multiplicative Stochastic Gradient Descent (M-SGD) algorithm as introduced
by Wu et al., which has a much more general noise class than the SGD algorithm
done via minibatching. We establish nonasymptotic bounds for the M-SGD
algorithm mainly with respect to the Stochastic Differential Equation
corresponding to SGD via minibatching. We also show that the M-SGD error is
approximately a scaled Gaussian distribution with mean $0$ at any fixed point
of the M-SGD algorithm. We also establish bounds for the convergence of the
M-SGD algorithm in the strongly convex regime.
Related papers
- Efficiency Ordering of Stochastic Gradient Descent [9.634481296779057]
We consider the gradient descent (SGD) algorithm driven by a general sampling sequence, including i.i.i.d noise and random walk on an arbitrary graph.
We employ the notion of efficiency ordering', a well-analyzed tool for comparing the performance of Markov Chain Monte Carlo samplers.
arXiv Detail & Related papers (2022-09-15T16:50:55Z) - Utilising the CLT Structure in Stochastic Gradient based Sampling :
Improved Analysis and Faster Algorithms [14.174806471635403]
We consider approximations of sampling algorithms, such as Gradient Langevin Dynamics (SGLD) and the Random Batch Method (RBM) for Interacting Particle Dynamcs (IPD)
We observe that the noise introduced by the approximation is nearly Gaussian due to the Central Limit Theorem (CLT) while the driving Brownian motion is exactly Gaussian.
We harness this structure to absorb the approximation error inside the diffusion process, and obtain improved convergence guarantees for these algorithms.
arXiv Detail & Related papers (2022-06-08T10:17:40Z) - Clipped Stochastic Methods for Variational Inequalities with
Heavy-Tailed Noise [64.85879194013407]
We prove the first high-probability results with logarithmic dependence on the confidence level for methods for solving monotone and structured non-monotone VIPs.
Our results match the best-known ones in the light-tails case and are novel for structured non-monotone problems.
In addition, we numerically validate that the gradient noise of many practical formulations is heavy-tailed and show that clipping improves the performance of SEG/SGDA.
arXiv Detail & Related papers (2022-06-02T15:21:55Z) - Computing the Variance of Shuffling Stochastic Gradient Algorithms via
Power Spectral Density Analysis [6.497816402045099]
Two common alternatives to gradient descent (SGD) with theoretical benefits are random reshuffling (SGDRR) and shuffle-once (SGD-SO)
We study the stationary variances of SGD, SGDRR and SGD-SO, whose leading terms decrease in this order, and obtain simple approximations.
arXiv Detail & Related papers (2022-06-01T17:08:04Z) - On the Convergence of mSGD and AdaGrad for Stochastic Optimization [0.696125353550498]
convex descent (SGD) has been intensively developed and extensively applied in machine learning in the past decade.
Some modified SGD-type algorithms, which outperform the SGD in many competitions and applications in terms of convergence rate and accuracy, such as momentum-based SGD (mSGD) and adaptive gradient optimization (AdaGrad)
We focus on convergence analysis of mSGD and AdaGrad for any smooth (possibly non-possibly non-possibly non-possibly) loss functions in machine learning.
arXiv Detail & Related papers (2022-01-26T22:02:21Z) - Gaussian Process Inference Using Mini-batch Stochastic Gradient Descent:
Convergence Guarantees and Empirical Benefits [21.353189917487512]
gradient descent (SGD) and its variants have established themselves as the go-to algorithms for machine learning problems.
We take a step forward by proving minibatch SGD converges to a critical point of the full log-likelihood loss function.
Our theoretical guarantees hold provided that the kernel functions exhibit exponential or eigendecay.
arXiv Detail & Related papers (2021-11-19T22:28:47Z) - On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD)
We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting.
We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z) - Differentiable Annealed Importance Sampling and the Perils of Gradient
Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation.
Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective.
We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z) - Benign Overfitting of Constant-Stepsize SGD for Linear Regression [122.70478935214128]
inductive biases are central in preventing overfitting empirically.
This work considers this issue in arguably the most basic setting: constant-stepsize SGD for linear regression.
We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares.
arXiv Detail & Related papers (2021-03-23T17:15:53Z) - On the Almost Sure Convergence of Stochastic Gradient Descent in
Non-Convex Problems [75.58134963501094]
This paper analyzes the trajectories of gradient descent (SGD)
We show that SGD avoids saddle points/manifolds with $1$ for strict step-size policies.
arXiv Detail & Related papers (2020-06-19T14:11:26Z) - Stability of Stochastic Gradient Descent on Nonsmooth Convex Losses [52.039438701530905]
We provide sharp upper and lower bounds for several forms of gradient descent (SGD) on arbitrary Lipschitz nonsmooth convex losses.
Our bounds allow us to derive a new algorithm for differentially private nonsmooth convex optimization with optimal excess population risk.
arXiv Detail & Related papers (2020-06-12T02:45:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.