Related papers: Revisiting Gradient Clipping: Stochastic bias and tight convergence guarantees

Revisiting Gradient Clipping: Stochastic bias and tight convergence guarantees

URL: http://arxiv.org/abs/2305.01588v2
Date: Thu, 9 Nov 2023 09:24:31 GMT
Title: Revisiting Gradient Clipping: Stochastic bias and tight convergence guarantees
Authors: Anastasia Koloskova, Hadrien Hendrikx, Sebastian U. Stich
Abstract summary: We give convergence guarantees that show precise dependence on arbitrary clipping thresholds $c$. In particular, we show that for deterministic gradient descent, the clipping threshold only affects the higher-order terms of convergence. We give matching upper and lower bounds for convergence of the gradient norm when running clipped SGD.
Score: 37.40957596986653
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Gradient clipping is a popular modification to standard (stochastic) gradient descent, at every iteration limiting the gradient norm to a certain value $c >0$. It is widely used for example for stabilizing the training of deep learning models (Goodfellow et al., 2016), or for enforcing differential privacy (Abadi et al., 2016). Despite popularity and simplicity of the clipping mechanism, its convergence guarantees often require specific values of $c$ and strong noise assumptions. In this paper, we give convergence guarantees that show precise dependence on arbitrary clipping thresholds $c$ and show that our guarantees are tight with both deterministic and stochastic gradients. In particular, we show that (i) for deterministic gradient descent, the clipping threshold only affects the higher-order terms of convergence, (ii) in the stochastic setting convergence to the true optimum cannot be guaranteed under the standard noise assumption, even under arbitrary small step-sizes. We give matching upper and lower bounds for convergence of the gradient norm when running clipped SGD, and illustrate these results with experiments.

Related papers

Revisiting Convergence: Shuffling Complexity Beyond Lipschitz Smoothness [50.78508362183774]
Shuffling-type gradient methods are favored in practice for their simplicity and rapid empirical performance.<n>Most require the Lipschitz condition, which is often not met in common machine learning schemes.
arXiv Detail & Related papers (2025-07-11T15:36:48Z)
Gradient Normalization Provably Benefits Nonconvex SGD under Heavy-Tailed Noise [60.92029979853314]
We investigate the roles of gradient normalization and clipping in ensuring the convergence of Gradient Descent (SGD) under heavy-tailed noise. Our work provides the first theoretical evidence demonstrating the benefits of gradient normalization in SGD under heavy-tailed noise. We introduce an accelerated SGD variant incorporating gradient normalization and clipping, further enhancing convergence rates under heavy-tailed noise.
arXiv Detail & Related papers (2024-10-21T22:40:42Z)
Convergence Analysis of Adaptive Gradient Methods under Refined Smoothness and Noise Assumptions [18.47705532817026]
We show that AdaGrad outperforms SGD by a factor of $d$ under certain conditions. Motivated by this, we introduce assumptions on the smoothness structure of the objective and the gradient variance.
arXiv Detail & Related papers (2024-06-07T02:55:57Z)
Convex and Non-convex Optimization Under Generalized Smoothness [69.69521650503431]
An analysis of convex and non- optimization methods often requires the Lipsitzness gradient, which limits the analysis by this trajectorys. Recent work generalizes the gradient setting via the non-uniform smoothness condition.
arXiv Detail & Related papers (2023-06-02T04:21:59Z)
Nonlinear gradient mappings and stochastic optimization: A general framework with applications to heavy-tail noise [11.768495184175052]
We introduce a general framework for nonlinear gradient descent scenarios when gradient noise exhibits heavy tails. We show that for a nonlinearity with bounded outputs and for the gradient noise that may not have finite moments of order greater than one, the nonlinear SGD converges to zero at rate$O(/tzeta)$, $zeta in (0,1)$. Experiments show that, while our framework is more general than existing studies of SGD under heavy-tail noise, several easy-to-implement nonlinearities from our framework are competitive with state of the art alternatives on real data sets
arXiv Detail & Related papers (2022-04-06T06:05:52Z)
The Power of Adaptivity in SGD: Self-Tuning Step Sizes with Unbounded Gradients and Affine Variance [46.15915820243487]
We show that AdaGrad-Norm exhibits an order optimal convergence of $mathcalOleft. We show that AdaGrad-Norm exhibits an order optimal convergence of $mathcalOleft.
arXiv Detail & Related papers (2022-02-11T17:37:54Z)
High-probability Bounds for Non-Convex Stochastic Optimization with Heavy Tails [55.561406656549686]
We consider non- Hilbert optimization using first-order algorithms for which the gradient estimates may have tails. We show that a combination of gradient, momentum, and normalized gradient descent convergence to critical points in high-probability with best-known iteration for smooth losses.
arXiv Detail & Related papers (2021-06-28T00:17:01Z)
Tight Second-Order Certificates for Randomized Smoothing [106.06908242424481]
We show that there also exists a universal curvature-like bound for Gaussian random smoothing. In addition to proving the correctness of this novel certificate, we show that SoS certificates are realizable and therefore tight.
arXiv Detail & Related papers (2020-10-20T18:03:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.