Revisiting Gradient Clipping: Stochastic bias and tight convergence
guarantees
- URL: http://arxiv.org/abs/2305.01588v2
- Date: Thu, 9 Nov 2023 09:24:31 GMT
- Title: Revisiting Gradient Clipping: Stochastic bias and tight convergence
guarantees
- Authors: Anastasia Koloskova, Hadrien Hendrikx, Sebastian U. Stich
- Abstract summary: We give convergence guarantees that show precise dependence on arbitrary clipping thresholds $c$.
In particular, we show that for deterministic gradient descent, the clipping threshold only affects the higher-order terms of convergence.
We give matching upper and lower bounds for convergence of the gradient norm when running clipped SGD.
- Score: 37.40957596986653
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Gradient clipping is a popular modification to standard (stochastic) gradient
descent, at every iteration limiting the gradient norm to a certain value $c
>0$. It is widely used for example for stabilizing the training of deep
learning models (Goodfellow et al., 2016), or for enforcing differential
privacy (Abadi et al., 2016). Despite popularity and simplicity of the clipping
mechanism, its convergence guarantees often require specific values of $c$ and
strong noise assumptions.
In this paper, we give convergence guarantees that show precise dependence on
arbitrary clipping thresholds $c$ and show that our guarantees are tight with
both deterministic and stochastic gradients. In particular, we show that (i)
for deterministic gradient descent, the clipping threshold only affects the
higher-order terms of convergence, (ii) in the stochastic setting convergence
to the true optimum cannot be guaranteed under the standard noise assumption,
even under arbitrary small step-sizes. We give matching upper and lower bounds
for convergence of the gradient norm when running clipped SGD, and illustrate
these results with experiments.
Related papers
- Gradient Normalization Provably Benefits Nonconvex SGD under Heavy-Tailed Noise [60.92029979853314]
We investigate the roles of gradient normalization and clipping in ensuring the convergence of Gradient Descent (SGD) under heavy-tailed noise.
Our work provides the first theoretical evidence demonstrating the benefits of gradient normalization in SGD under heavy-tailed noise.
We introduce an accelerated SGD variant incorporating gradient normalization and clipping, further enhancing convergence rates under heavy-tailed noise.
arXiv Detail & Related papers (2024-10-21T22:40:42Z) - Convergence Analysis of Adaptive Gradient Methods under Refined Smoothness and Noise Assumptions [18.47705532817026]
We show that AdaGrad outperforms SGD by a factor of $d$ under certain conditions.
Motivated by this, we introduce assumptions on the smoothness structure of the objective and the gradient variance.
arXiv Detail & Related papers (2024-06-07T02:55:57Z) - Convex and Non-convex Optimization Under Generalized Smoothness [69.69521650503431]
An analysis of convex and non- optimization methods often requires the Lipsitzness gradient, which limits the analysis by this trajectorys.
Recent work generalizes the gradient setting via the non-uniform smoothness condition.
arXiv Detail & Related papers (2023-06-02T04:21:59Z) - Nonlinear gradient mappings and stochastic optimization: A general
framework with applications to heavy-tail noise [11.768495184175052]
We introduce a general framework for nonlinear gradient descent scenarios when gradient noise exhibits heavy tails.
We show that for a nonlinearity with bounded outputs and for the gradient noise that may not have finite moments of order greater than one, the nonlinear SGD converges to zero at rate$O(/tzeta)$, $zeta in (0,1)$.
Experiments show that, while our framework is more general than existing studies of SGD under heavy-tail noise, several easy-to-implement nonlinearities from our framework are competitive with state of the art alternatives on real data sets
arXiv Detail & Related papers (2022-04-06T06:05:52Z) - The Power of Adaptivity in SGD: Self-Tuning Step Sizes with Unbounded
Gradients and Affine Variance [46.15915820243487]
We show that AdaGrad-Norm exhibits an order optimal convergence of $mathcalOleft.
We show that AdaGrad-Norm exhibits an order optimal convergence of $mathcalOleft.
arXiv Detail & Related papers (2022-02-11T17:37:54Z) - High-probability Bounds for Non-Convex Stochastic Optimization with
Heavy Tails [55.561406656549686]
We consider non- Hilbert optimization using first-order algorithms for which the gradient estimates may have tails.
We show that a combination of gradient, momentum, and normalized gradient descent convergence to critical points in high-probability with best-known iteration for smooth losses.
arXiv Detail & Related papers (2021-06-28T00:17:01Z) - Tight Second-Order Certificates for Randomized Smoothing [106.06908242424481]
We show that there also exists a universal curvature-like bound for Gaussian random smoothing.
In addition to proving the correctness of this novel certificate, we show that SoS certificates are realizable and therefore tight.
arXiv Detail & Related papers (2020-10-20T18:03:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.