Stochastic Optimization with Heavy-Tailed Noise via Accelerated Gradient
Clipping
- URL: http://arxiv.org/abs/2005.10785v2
- Date: Fri, 23 Oct 2020 11:00:08 GMT
- Title: Stochastic Optimization with Heavy-Tailed Noise via Accelerated Gradient
Clipping
- Authors: Eduard Gorbunov, Marina Danilova, Alexander Gasnikov
- Abstract summary: We propose a new accelerated first-order method called clipped-SSTM for smooth convex optimization with heavy-tailed distributed noise in gradients.
We prove new complexity that outperform state-of-the-art results in this case.
We derive the first non-trivial high-probability complexity bounds for SGD with clipping without light-tails assumption on the noise.
- Score: 69.9674326582747
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a new accelerated stochastic first-order method
called clipped-SSTM for smooth convex stochastic optimization with heavy-tailed
distributed noise in stochastic gradients and derive the first high-probability
complexity bounds for this method closing the gap in the theory of stochastic
optimization with heavy-tailed noise. Our method is based on a special variant
of accelerated Stochastic Gradient Descent (SGD) and clipping of stochastic
gradients. We extend our method to the strongly convex case and prove new
complexity bounds that outperform state-of-the-art results in this case.
Finally, we extend our proof technique and derive the first non-trivial
high-probability complexity bounds for SGD with clipping without light-tails
assumption on the noise.
Related papers
- Flattened one-bit stochastic gradient descent: compressed distributed optimization with controlled variance [55.01966743652196]
We propose a novel algorithm for distributed gradient descent (SGD) with compressed gradient communication in the parameter-server framework.
Our gradient compression technique, named flattened one-bit gradient descent (FO-SGD), relies on two simple algorithmic ideas.
arXiv Detail & Related papers (2024-05-17T21:17:27Z) - Variance Reduction and Low Sample Complexity in Stochastic Optimization
via Proximal Point Method [5.025654873456757]
The paper establishes a low sample complexity to obtain a high probability guarantee on the convergence of the proposed method.
A subroutine is developed to solve the proximal subproblem, which also serves as a novel technique for variance reduction.
arXiv Detail & Related papers (2024-02-14T07:34:22Z) - Breaking the Heavy-Tailed Noise Barrier in Stochastic Optimization Problems [56.86067111855056]
We consider clipped optimization problems with heavy-tailed noise with structured density.
We show that it is possible to get faster rates of convergence than $mathcalO(K-(alpha - 1)/alpha)$, when the gradients have finite moments of order.
We prove that the resulting estimates have negligible bias and controllable variance.
arXiv Detail & Related papers (2023-11-07T17:39:17Z) - High-Probability Convergence for Composite and Distributed Stochastic Minimization and Variational Inequalities with Heavy-Tailed Noise [96.80184504268593]
gradient, clipping is one of the key algorithmic ingredients to derive good high-probability guarantees.
Clipping can spoil the convergence of the popular methods for composite and distributed optimization.
arXiv Detail & Related papers (2023-10-03T07:49:17Z) - First Order Methods with Markovian Noise: from Acceleration to Variational Inequalities [91.46841922915418]
We present a unified approach for the theoretical analysis of first-order variation methods.
Our approach covers both non-linear gradient and strongly Monte Carlo problems.
We provide bounds that match the oracle strongly in the case of convex method optimization problems.
arXiv Detail & Related papers (2023-05-25T11:11:31Z) - Utilising the CLT Structure in Stochastic Gradient based Sampling :
Improved Analysis and Faster Algorithms [14.174806471635403]
We consider approximations of sampling algorithms, such as Gradient Langevin Dynamics (SGLD) and the Random Batch Method (RBM) for Interacting Particle Dynamcs (IPD)
We observe that the noise introduced by the approximation is nearly Gaussian due to the Central Limit Theorem (CLT) while the driving Brownian motion is exactly Gaussian.
We harness this structure to absorb the approximation error inside the diffusion process, and obtain improved convergence guarantees for these algorithms.
arXiv Detail & Related papers (2022-06-08T10:17:40Z) - Computing the Variance of Shuffling Stochastic Gradient Algorithms via
Power Spectral Density Analysis [6.497816402045099]
Two common alternatives to gradient descent (SGD) with theoretical benefits are random reshuffling (SGDRR) and shuffle-once (SGD-SO)
We study the stationary variances of SGD, SGDRR and SGD-SO, whose leading terms decrease in this order, and obtain simple approximations.
arXiv Detail & Related papers (2022-06-01T17:08:04Z) - High Probability Complexity Bounds for Non-Smooth Stochastic Optimization with Heavy-Tailed Noise [51.31435087414348]
It is essential to theoretically guarantee that algorithms provide small objective residual with high probability.
Existing methods for non-smooth convex optimization have complexity bounds with dependence on confidence level.
We propose novel stepsize rules for two methods with gradient clipping.
arXiv Detail & Related papers (2021-06-10T17:54:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.