Stability and Convergence of Stochastic Gradient Clipping: Beyond
Lipschitz Continuity and Smoothness
- URL: http://arxiv.org/abs/2102.06489v1
- Date: Fri, 12 Feb 2021 12:41:42 GMT
- Title: Stability and Convergence of Stochastic Gradient Clipping: Beyond
Lipschitz Continuity and Smoothness
- Authors: Vien V. Mai and Mikael Johansson
- Abstract summary: Gradient clipping is a technique to stabilize the training process for problems prone to the exploding gradient problem.
This paper establishes both qualitative and quantitative results of the gradient clipped (sub)gradient method (SGD) for non-smooth convex functions.
We also study the convergence of a clipped method with momentum, which includes clipped SGD as a special case.
- Score: 23.22461721824713
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Stochastic gradient algorithms are often unstable when applied to functions
that do not have Lipschitz-continuous and/or bounded gradients. Gradient
clipping is a simple and effective technique to stabilize the training process
for problems that are prone to the exploding gradient problem. Despite its
widespread popularity, the convergence properties of the gradient clipping
heuristic are poorly understood, especially for stochastic problems. This paper
establishes both qualitative and quantitative convergence results of the
clipped stochastic (sub)gradient method (SGD) for non-smooth convex functions
with rapidly growing subgradients. Our analyses show that clipping enhances the
stability of SGD and that the clipped SGD algorithm enjoys finite convergence
rates in many cases. We also study the convergence of a clipped method with
momentum, which includes clipped SGD as a special case, for weakly convex
problems under standard assumptions. With a novel Lyapunov analysis, we show
that the proposed method achieves the best-known rate for the considered class
of problems, demonstrating the effectiveness of clipped methods also in this
regime. Numerical results confirm our theoretical developments.
Related papers
- Diagonalisation SGD: Fast & Convergent SGD for Non-Differentiable Models
via Reparameterisation and Smoothing [1.6114012813668932]
We introduce a simple framework to define non-differentiable functions piecewisely and present a systematic approach to obtain smoothings.
Our main contribution is a novel variant of SGD, Diagonalisation Gradient Descent, which progressively enhances the accuracy of the smoothed approximation.
Our approach is simple, fast stable and attains orders of magnitude reduction in work-normalised variance.
arXiv Detail & Related papers (2024-02-19T00:43:22Z) - High Probability Analysis for Non-Convex Stochastic Optimization with
Clipping [13.025261730510847]
gradient clipping is a technique for dealing with the heavy-tailed neural networks.
Most theoretical guarantees only provide an in-expectation analysis and only on the performance.
Our analysis provides a relatively complete picture for the theoretical guarantee of optimization algorithms with gradient clipping.
arXiv Detail & Related papers (2023-07-25T17:36:56Z) - Almost Sure Saddle Avoidance of Stochastic Gradient Methods without the
Bounded Gradient Assumption [11.367487348673793]
We prove that various gradient descent methods, including the gradient descent (SGD), heavy-ball (SHB) and Nesterov's accelerated gradient (SNAG) methods, almost surely avoid any strict saddle manifold.
This is the first time such results are obtained for SHB and SNAG methods.
arXiv Detail & Related papers (2023-02-15T18:53:41Z) - Beyond the Edge of Stability via Two-step Gradient Updates [49.03389279816152]
Gradient Descent (GD) is a powerful workhorse of modern machine learning.
GD's ability to find local minimisers is only guaranteed for losses with Lipschitz gradients.
This work focuses on simple, yet representative, learning problems via analysis of two-step gradient updates.
arXiv Detail & Related papers (2022-06-08T21:32:50Z) - Clipped Stochastic Methods for Variational Inequalities with
Heavy-Tailed Noise [64.85879194013407]
We prove the first high-probability results with logarithmic dependence on the confidence level for methods for solving monotone and structured non-monotone VIPs.
Our results match the best-known ones in the light-tails case and are novel for structured non-monotone problems.
In addition, we numerically validate that the gradient noise of many practical formulations is heavy-tailed and show that clipping improves the performance of SEG/SGDA.
arXiv Detail & Related papers (2022-06-02T15:21:55Z) - Differentiable Annealed Importance Sampling and the Perils of Gradient
Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation.
Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective.
We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z) - Stability and Generalization of Stochastic Gradient Methods for Minimax
Problems [71.60601421935844]
Many machine learning problems can be formulated as minimax problems such as Generative Adversarial Networks (GANs)
We provide a comprehensive generalization analysis of examples from training gradient methods for minimax problems.
arXiv Detail & Related papers (2021-05-08T22:38:00Z) - Improved Analysis of Clipping Algorithms for Non-convex Optimization [19.507750439784605]
Recently, citetzhang 2019gradient show that clipped (stochastic) Gradient Descent (GD) converges faster than vanilla GD/SGD.
Experiments confirm the superiority of clipping-based methods in deep learning tasks.
arXiv Detail & Related papers (2020-10-05T14:36:59Z) - Fine-Grained Analysis of Stability and Generalization for Stochastic
Gradient Descent [55.85456985750134]
We introduce a new stability measure called on-average model stability, for which we develop novel bounds controlled by the risks of SGD iterates.
This yields generalization bounds depending on the behavior of the best model, and leads to the first-ever-known fast bounds in the low-noise setting.
To our best knowledge, this gives the firstever-known stability and generalization for SGD with even non-differentiable loss functions.
arXiv Detail & Related papers (2020-06-15T06:30:19Z) - On Learning Rates and Schr\"odinger Operators [105.32118775014015]
We present a general theoretical analysis of the effect of the learning rate.
We find that the learning rate tends to zero for a broad non- neural class functions.
arXiv Detail & Related papers (2020-04-15T09:52:37Z) - A frequency-domain analysis of inexact gradient methods [0.0]
We study robustness properties of some iterative gradient-based methods for strongly convex functions.
We derive improved analytic bounds for the convergence rate of Nesterov's accelerated method on strongly convex functions.
arXiv Detail & Related papers (2019-12-31T18:47:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.