Understanding the unstable convergence of gradient descent
- URL: http://arxiv.org/abs/2204.01050v1
- Date: Sun, 3 Apr 2022 11:10:17 GMT
- Title: Understanding the unstable convergence of gradient descent
- Authors: Kwangjun Ahn, Jingzhao Zhang, Suvrit Sra
- Abstract summary: In machine learning applications step sizes often do not fulfill the condition that for $L$-smooth cost, the step size is less than $2/L$.
We investigate this unstable convergence phenomenon from first principles, and elucidate key causes behind it.
We also identify its main characteristics, and how they interrelate, offering a transparent view backed by both theory and experiments.
- Score: 51.40523554349091
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most existing analyses of (stochastic) gradient descent rely on the condition
that for $L$-smooth cost, the step size is less than $2/L$. However, many works
have observed that in machine learning applications step sizes often do not
fulfill this condition, yet (stochastic) gradient descent converges, albeit in
an unstable manner. We investigate this unstable convergence phenomenon from
first principles, and elucidate key causes behind it. We also identify its main
characteristics, and how they interrelate, offering a transparent view backed
by both theory and experiments.
Related papers
- On the Convergence of Gradient Descent for Large Learning Rates [55.33626480243135]
We show that convergence is impossible when a fixed step size is used.
We provide a proof of this in the case of linear neural networks with a squared loss.
We also prove the impossibility of convergence for more general losses without requiring strong assumptions such as Lipschitz continuity for the gradient.
arXiv Detail & Related papers (2024-02-20T16:01:42Z) - Convex and Non-convex Optimization Under Generalized Smoothness [69.69521650503431]
An analysis of convex and non- optimization methods often requires the Lipsitzness gradient, which limits the analysis by this trajectorys.
Recent work generalizes the gradient setting via the non-uniform smoothness condition.
arXiv Detail & Related papers (2023-06-02T04:21:59Z) - Almost Sure Saddle Avoidance of Stochastic Gradient Methods without the
Bounded Gradient Assumption [11.367487348673793]
We prove that various gradient descent methods, including the gradient descent (SGD), heavy-ball (SHB) and Nesterov's accelerated gradient (SNAG) methods, almost surely avoid any strict saddle manifold.
This is the first time such results are obtained for SHB and SNAG methods.
arXiv Detail & Related papers (2023-02-15T18:53:41Z) - Self-Stabilization: The Implicit Bias of Gradient Descent at the Edge of
Stability [40.17821914923602]
We show that gradient descent at the edge of stability implicitly follows projected gradient descent (PGD) under the constraint $S(theta) le 2/eta$.
Our analysis provides precise predictions for the loss, sharpness, and deviation from the PGD trajectory throughout training.
arXiv Detail & Related papers (2022-09-30T17:15:12Z) - Beyond the Edge of Stability via Two-step Gradient Updates [49.03389279816152]
Gradient Descent (GD) is a powerful workhorse of modern machine learning.
GD's ability to find local minimisers is only guaranteed for losses with Lipschitz gradients.
This work focuses on simple, yet representative, learning problems via analysis of two-step gradient updates.
arXiv Detail & Related papers (2022-06-08T21:32:50Z) - Stability and Generalization of Stochastic Gradient Methods for Minimax
Problems [71.60601421935844]
Many machine learning problems can be formulated as minimax problems such as Generative Adversarial Networks (GANs)
We provide a comprehensive generalization analysis of examples from training gradient methods for minimax problems.
arXiv Detail & Related papers (2021-05-08T22:38:00Z) - Analytic Signal Phase in $N-D$ by Linear Symmetry Tensor--fingerprint
modeling [69.35569554213679]
We show that the Analytic Signal phase, and its gradient have a hitherto unstudied discontinuity in $2-D $ and higher dimensions.
This shortcoming can result in severe artifacts whereas the problem does not exist in $1-D $ signals.
We suggest the use of Linear Symmetry phase, relying on more than one set of Gabor filters, but with a negligible computational add-on.
arXiv Detail & Related papers (2020-05-16T21:17:26Z) - Explore Aggressively, Update Conservatively: Stochastic Extragradient
Methods with Variable Stepsize Scaling [34.35013145885164]
Extragradient methods have become a staple for solving large-scale saddlepoint problems in machine learning.
We show in this paper that running vanilla extragradient with gradients may jeopardize its convergence, even in simple bilinear models.
We show that this modification allows the method to converge even with gradients, and we derive sharp convergence rates under an error bound condition.
arXiv Detail & Related papers (2020-03-23T10:24:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.