Related papers: Understanding the unstable convergence of gradient descent

Understanding the unstable convergence of gradient descent

URL: http://arxiv.org/abs/2204.01050v1
Date: Sun, 3 Apr 2022 11:10:17 GMT
Title: Understanding the unstable convergence of gradient descent
Authors: Kwangjun Ahn, Jingzhao Zhang, Suvrit Sra
Abstract summary: In machine learning applications step sizes often do not fulfill the condition that for $L$-smooth cost, the step size is less than $2/L$. We investigate this unstable convergence phenomenon from first principles, and elucidate key causes behind it. We also identify its main characteristics, and how they interrelate, offering a transparent view backed by both theory and experiments.
Score: 51.40523554349091
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Most existing analyses of (stochastic) gradient descent rely on the condition that for $L$-smooth cost, the step size is less than $2/L$. However, many works have observed that in machine learning applications step sizes often do not fulfill this condition, yet (stochastic) gradient descent converges, albeit in an unstable manner. We investigate this unstable convergence phenomenon from first principles, and elucidate key causes behind it. We also identify its main characteristics, and how they interrelate, offering a transparent view backed by both theory and experiments.

Related papers

Stochastic Gradient Descent in Non-Convex Problems: Asymptotic Convergence with Relaxed Step-Size via Stopping Time Methods [13.677904140815386]
Gradient Descent (SGD) is widely used in machine learning research. This paper introduces a novel analytical method for convergence analysis of SGD under more relaxed step-size conditions.
arXiv Detail & Related papers (2025-04-17T02:56:20Z)
On the Convergence of Gradient Descent for Large Learning Rates [55.33626480243135]
We show that convergence is impossible when a fixed step size is used. We provide a proof of this in the case of linear neural networks with a squared loss. We also prove the impossibility of convergence for more general losses without requiring strong assumptions such as Lipschitz continuity for the gradient.
arXiv Detail & Related papers (2024-02-20T16:01:42Z)
Convex and Non-convex Optimization Under Generalized Smoothness [69.69521650503431]
An analysis of convex and non- optimization methods often requires the Lipsitzness gradient, which limits the analysis by this trajectorys. Recent work generalizes the gradient setting via the non-uniform smoothness condition.
arXiv Detail & Related papers (2023-06-02T04:21:59Z)
Almost Sure Saddle Avoidance of Stochastic Gradient Methods without the Bounded Gradient Assumption [11.367487348673793]
We prove that various gradient descent methods, including the gradient descent (SGD), heavy-ball (SHB) and Nesterov's accelerated gradient (SNAG) methods, almost surely avoid any strict saddle manifold. This is the first time such results are obtained for SHB and SNAG methods.
arXiv Detail & Related papers (2023-02-15T18:53:41Z)
Self-Stabilization: The Implicit Bias of Gradient Descent at the Edge of Stability [40.17821914923602]
We show that gradient descent at the edge of stability implicitly follows projected gradient descent (PGD) under the constraint $S(theta) le 2/eta$. Our analysis provides precise predictions for the loss, sharpness, and deviation from the PGD trajectory throughout training.
arXiv Detail & Related papers (2022-09-30T17:15:12Z)
Beyond the Edge of Stability via Two-step Gradient Updates [49.03389279816152]
Gradient Descent (GD) is a powerful workhorse of modern machine learning. GD's ability to find local minimisers is only guaranteed for losses with Lipschitz gradients. This work focuses on simple, yet representative, learning problems via analysis of two-step gradient updates.
arXiv Detail & Related papers (2022-06-08T21:32:50Z)
Stability and Generalization of Stochastic Gradient Methods for Minimax Problems [71.60601421935844]
Many machine learning problems can be formulated as minimax problems such as Generative Adversarial Networks (GANs) We provide a comprehensive generalization analysis of examples from training gradient methods for minimax problems.
arXiv Detail & Related papers (2021-05-08T22:38:00Z)
Analytic Signal Phase in $N-D$ by Linear Symmetry Tensor--fingerprint modeling [69.35569554213679]
We show that the Analytic Signal phase, and its gradient have a hitherto unstudied discontinuity in $2-D $ and higher dimensions. This shortcoming can result in severe artifacts whereas the problem does not exist in $1-D $ signals. We suggest the use of Linear Symmetry phase, relying on more than one set of Gabor filters, but with a negligible computational add-on.
arXiv Detail & Related papers (2020-05-16T21:17:26Z)
Explore Aggressively, Update Conservatively: Stochastic Extragradient Methods with Variable Stepsize Scaling [34.35013145885164]
Extragradient methods have become a staple for solving large-scale saddlepoint problems in machine learning. We show in this paper that running vanilla extragradient with gradients may jeopardize its convergence, even in simple bilinear models. We show that this modification allows the method to converge even with gradients, and we derive sharp convergence rates under an error bound condition.
arXiv Detail & Related papers (2020-03-23T10:24:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.