Convergence Rates of Stochastic Gradient Descent under Infinite Noise
Variance
- URL: http://arxiv.org/abs/2102.10346v1
- Date: Sat, 20 Feb 2021 13:45:11 GMT
- Title: Convergence Rates of Stochastic Gradient Descent under Infinite Noise
Variance
- Authors: Hongjian Wang, Mert G\"urb\"uzbalaban, Lingjiong Zhu, Umut
\c{S}im\c{s}ekli, Murat A. Erdogdu
- Abstract summary: Heavy tails emerge in gradient descent (SGD) in various scenarios.
We provide convergence guarantees for SGD under a state-dependent and heavy-tailed noise with a potentially infinite variance.
Our results indicate that even under heavy-tailed noise with infinite variance, SGD can converge to the global optimum.
- Score: 14.06947898164194
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent studies have provided both empirical and theoretical evidence
illustrating that heavy tails can emerge in stochastic gradient descent (SGD)
in various scenarios. Such heavy tails potentially result in iterates with
diverging variance, which hinders the use of conventional convergence analysis
techniques that rely on the existence of the second-order moments. In this
paper, we provide convergence guarantees for SGD under a state-dependent and
heavy-tailed noise with a potentially infinite variance, for a class of
strongly convex objectives. In the case where the $p$-th moment of the noise
exists for some $p\in [1,2)$, we first identify a condition on the Hessian,
coined '$p$-positive (semi-)definiteness', that leads to an interesting
interpolation between positive semi-definite matrices ($p=2$) and diagonally
dominant matrices with non-negative diagonal entries ($p=1$). Under this
condition, we then provide a convergence rate for the distance to the global
optimum in $L^p$. Furthermore, we provide a generalized central limit theorem,
which shows that the properly scaled Polyak-Ruppert averaging converges weakly
to a multivariate $\alpha$-stable random vector. Our results indicate that even
under heavy-tailed noise with infinite variance, SGD can converge to the global
optimum without necessitating any modification neither to the loss function or
to the algorithm itself, as typically required in robust statistics. We
demonstrate the implications of our results to applications such as linear
regression and generalized linear models subject to heavy-tailed data.
Related papers
- A Unified Analysis for Finite Weight Averaging [50.75116992029417]
Averaging iterations of Gradient Descent (SGD) have achieved empirical success in training deep learning models, such as Weight Averaging (SWA), Exponential Moving Average (EMA), and LAtest Weight Averaging (LAWA)
In this paper, we generalize LAWA as Finite Weight Averaging (FWA) and explain their advantages compared to SGD from the perspective of optimization and generalization.
arXiv Detail & Related papers (2024-11-20T10:08:22Z) - Nonlinear Stochastic Gradient Descent and Heavy-tailed Noise: A Unified Framework and High-probability Guarantees [56.80920351680438]
We study high-probability convergence in online learning, in the presence of heavy-tailed noise.
Compared to state-of-the-art, who only consider clipping and require noise with bounded $p$-th moments, we provide guarantees for a broad class of nonlinearities.
arXiv Detail & Related papers (2024-10-17T18:25:28Z) - Nonasymptotic Analysis of Stochastic Gradient Descent with the Richardson-Romberg Extrapolation [22.652143194356864]
We address the problem of solving strongly convex and smooth problems using a descent gradient (SGD) algorithm with a constant step size.
We provide an expansion of the mean-squared error of the resulting estimator with respect to the number iterations of $n$.
We establish that this chain is geometrically ergodic with respect to a defined weighted Wasserstein semimetric.
arXiv Detail & Related papers (2024-10-07T15:02:48Z) - Revisiting the Last-Iterate Convergence of Stochastic Gradient Methods [25.831462008050387]
The Gradient Descent (SGD) algorithm has triggered people's interest due to its good performance in practice but lack of theoretical understanding.
It still remains unclear whether the lastiterate convergence can be provably extended to wider composite optimization and non-Euclidean norms.
arXiv Detail & Related papers (2023-12-13T21:41:06Z) - Convergence Rates for Stochastic Approximation: Biased Noise with Unbounded Variance, and Applications [2.0584253077707477]
We study the convergence properties of the Gradient Descent (SGD) method for finding a stationary point of an objective function $J(cdot)$.
Our results apply to a class of invex'' functions, which have the property that every stationary point is also a global minimizer.
arXiv Detail & Related papers (2023-12-05T15:22:39Z) - Breaking the Heavy-Tailed Noise Barrier in Stochastic Optimization Problems [56.86067111855056]
We consider clipped optimization problems with heavy-tailed noise with structured density.
We show that it is possible to get faster rates of convergence than $mathcalO(K-(alpha - 1)/alpha)$, when the gradients have finite moments of order.
We prove that the resulting estimates have negligible bias and controllable variance.
arXiv Detail & Related papers (2023-11-07T17:39:17Z) - High-probability Convergence Bounds for Nonlinear Stochastic Gradient Descent Under Heavy-tailed Noise [59.25598762373543]
We show that wetailed high-prob convergence guarantees of learning on streaming data in the presence of heavy-tailed noise.
We demonstrate analytically and that $ta$ can be used to the preferred choice of setting for a given problem.
arXiv Detail & Related papers (2023-10-28T18:53:41Z) - Empirical Risk Minimization with Shuffled SGD: A Primal-Dual Perspective
and Improved Bounds [12.699376765058137]
gradient descent (SGD) is perhaps the most prevalent optimization method in modern machine learning.
It is only very recently that SGD with sampling without replacement -- shuffled SGD -- has been analyzed.
We prove fine-grained complexity bounds that depend on the data matrix and are never worse than what is predicted by the existing bounds.
arXiv Detail & Related papers (2023-06-21T18:14:44Z) - On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD)
We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting.
We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z) - Last iterate convergence of SGD for Least-Squares in the Interpolation
regime [19.05750582096579]
We study the noiseless model in the fundamental least-squares setup.
We assume that an optimum predictor fits perfectly inputs and outputs $langle theta_*, phi(X) rangle = Y$, where $phi(X)$ stands for a possibly infinite dimensional non-linear feature map.
arXiv Detail & Related papers (2021-02-05T14:02:20Z) - On Linear Stochastic Approximation: Fine-grained Polyak-Ruppert and
Non-Asymptotic Concentration [115.1954841020189]
We study the inequality and non-asymptotic properties of approximation procedures with Polyak-Ruppert averaging.
We prove a central limit theorem (CLT) for the averaged iterates with fixed step size and number of iterations going to infinity.
arXiv Detail & Related papers (2020-04-09T17:54:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.