Beyond the Edge of Stability via Two-step Gradient Updates
- URL: http://arxiv.org/abs/2206.04172v3
- Date: Wed, 26 Jul 2023 10:48:54 GMT
- Title: Beyond the Edge of Stability via Two-step Gradient Updates
- Authors: Lei Chen, Joan Bruna
- Abstract summary: Gradient Descent (GD) is a powerful workhorse of modern machine learning.
GD's ability to find local minimisers is only guaranteed for losses with Lipschitz gradients.
This work focuses on simple, yet representative, learning problems via analysis of two-step gradient updates.
- Score: 49.03389279816152
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Gradient Descent (GD) is a powerful workhorse of modern machine learning
thanks to its scalability and efficiency in high-dimensional spaces. Its
ability to find local minimisers is only guaranteed for losses with Lipschitz
gradients, where it can be seen as a `bona-fide' discretisation of an
underlying gradient flow. Yet, many ML setups involving overparametrised models
do not fall into this problem class, which has motivated research beyond the
so-called ``Edge of Stability'' (EoS), where the step-size crosses the
admissibility threshold inversely proportional to the Lipschitz constant above.
Perhaps surprisingly, GD has been empirically observed to still converge
regardless of local instability and oscillatory behavior.
The incipient theoretical analysis of this phenomena has mainly focused in
the overparametrised regime, where the effect of choosing a large learning rate
may be associated to a `Sharpness-Minimisation' implicit regularisation within
the manifold of minimisers, under appropriate asymptotic limits. In contrast,
in this work we directly examine the conditions for such unstable convergence,
focusing on simple, yet representative, learning problems, via analysis of
two-step gradient updates. Specifically, we characterize a local condition
involving third-order derivatives that guarantees existence and convergence to
fixed points of the two-step updates, and leverage such property in a
teacher-student setting, under population loss. Finally, starting from Matrix
Factorization, we provide observations of period-2 orbit of GD in
high-dimensional settings with intuition of its dynamics, along with
exploration into more general settings.
Related papers
- On the Convergence of Gradient Descent for Large Learning Rates [55.33626480243135]
We show that convergence is impossible when a fixed step size is used.
We provide a proof of this in the case of linear neural networks with a squared loss.
We also prove the impossibility of convergence for more general losses without requiring strong assumptions such as Lipschitz continuity for the gradient.
arXiv Detail & Related papers (2024-02-20T16:01:42Z) - Provably Accelerating Ill-Conditioned Low-rank Estimation via Scaled
Gradient Descent, Even with Overparameterization [48.65416821017865]
This chapter introduces a new algorithmic approach, dubbed scaled gradient (ScaledGD)
It converges linearly at a constant rate independent of the condition number of the low-rank object.
It maintains the low periteration cost of gradient descent for a variety of tasks.
arXiv Detail & Related papers (2023-10-09T21:16:57Z) - Convergence of mean-field Langevin dynamics: Time and space
discretization, stochastic gradient, and variance reduction [49.66486092259376]
The mean-field Langevin dynamics (MFLD) is a nonlinear generalization of the Langevin dynamics that incorporates a distribution-dependent drift.
Recent works have shown that MFLD globally minimizes an entropy-regularized convex functional in the space of measures.
We provide a framework to prove a uniform-in-time propagation of chaos for MFLD that takes into account the errors due to finite-particle approximation, time-discretization, and gradient approximation.
arXiv Detail & Related papers (2023-06-12T16:28:11Z) - Implicit Bias of Gradient Descent for Logistic Regression at the Edge of
Stability [69.01076284478151]
In machine learning optimization, gradient descent (GD) often operates at the edge of stability (EoS)
This paper studies the convergence and implicit bias of constant-stepsize GD for logistic regression on linearly separable data in the EoS regime.
arXiv Detail & Related papers (2023-05-19T16:24:47Z) - A PDE-based Explanation of Extreme Numerical Sensitivities and Edge of Stability in Training Neural Networks [12.355137704908042]
We show restrained numerical instabilities in current training practices of deep networks with gradient descent (SGD)
We do this by presenting a theoretical framework using numerical analysis of partial differential equations (PDE), and analyzing the gradient descent PDE of convolutional neural networks (CNNs)
We show this is a consequence of the non-linear PDE associated with the descent of the CNN, whose local linearization changes when over-driving the step size of the discretization resulting in a stabilizing effect.
arXiv Detail & Related papers (2022-06-04T14:54:05Z) - A Local Convergence Theory for the Stochastic Gradient Descent Method in
Non-Convex Optimization With Non-isolated Local Minima [0.0]
Non-isolated minima presents a unique challenge that has remained under-explored.
In this paper, we study the local convergence of the gradient descent method to non-isolated global minima.
arXiv Detail & Related papers (2022-03-21T13:33:37Z) - Fine-Grained Analysis of Stability and Generalization for Stochastic
Gradient Descent [55.85456985750134]
We introduce a new stability measure called on-average model stability, for which we develop novel bounds controlled by the risks of SGD iterates.
This yields generalization bounds depending on the behavior of the best model, and leads to the first-ever-known fast bounds in the low-noise setting.
To our best knowledge, this gives the firstever-known stability and generalization for SGD with even non-differentiable loss functions.
arXiv Detail & Related papers (2020-06-15T06:30:19Z) - Convergence and sample complexity of gradient methods for the model-free
linear quadratic regulator problem [27.09339991866556]
We show that ODE searches for optimal control for an unknown computation system by directly searching over the corresponding space of controllers.
We take a step towards demystifying the performance and efficiency of such methods by focusing on the gradient-flow dynamics set of stabilizing feedback gains and a similar result holds for the forward disctization of the ODE.
arXiv Detail & Related papers (2019-12-26T16:56:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.