Related papers: Beyond the Edge of Stability via Two-step Gradient Updates

Beyond the Edge of Stability via Two-step Gradient Updates

URL: http://arxiv.org/abs/2206.04172v3
Date: Wed, 26 Jul 2023 10:48:54 GMT
Title: Beyond the Edge of Stability via Two-step Gradient Updates
Authors: Lei Chen, Joan Bruna
Abstract summary: Gradient Descent (GD) is a powerful workhorse of modern machine learning. GD's ability to find local minimisers is only guaranteed for losses with Lipschitz gradients. This work focuses on simple, yet representative, learning problems via analysis of two-step gradient updates.
Score: 49.03389279816152
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Gradient Descent (GD) is a powerful workhorse of modern machine learning thanks to its scalability and efficiency in high-dimensional spaces. Its ability to find local minimisers is only guaranteed for losses with Lipschitz gradients, where it can be seen as a `bona-fide' discretisation of an underlying gradient flow. Yet, many ML setups involving overparametrised models do not fall into this problem class, which has motivated research beyond the so-called ``Edge of Stability'' (EoS), where the step-size crosses the admissibility threshold inversely proportional to the Lipschitz constant above. Perhaps surprisingly, GD has been empirically observed to still converge regardless of local instability and oscillatory behavior. The incipient theoretical analysis of this phenomena has mainly focused in the overparametrised regime, where the effect of choosing a large learning rate may be associated to a `Sharpness-Minimisation' implicit regularisation within the manifold of minimisers, under appropriate asymptotic limits. In contrast, in this work we directly examine the conditions for such unstable convergence, focusing on simple, yet representative, learning problems, via analysis of two-step gradient updates. Specifically, we characterize a local condition involving third-order derivatives that guarantees existence and convergence to fixed points of the two-step updates, and leverage such property in a teacher-student setting, under population loss. Finally, starting from Matrix Factorization, we provide observations of period-2 orbit of GD in high-dimensional settings with intuition of its dynamics, along with exploration into more general settings.

Related papers

Description of the Training Process of Neural Networks via Ergodic Theorem : Ghost nodes [3.637162892228131]
We present a unified framework for understanding and accelerating deep neural networks via training gradient descent (SGD)<n>We introduce a practical diagnostic, the running estimate of the largest Lyapunov exponent, which distinguishes genuine convergence toward stablers.<n>We propose a ghost category extension for standard classifiers that adds auxiliary ghost output nodes so the model gains extra descent directions.
arXiv Detail & Related papers (2025-07-01T17:54:35Z)
Implicit Regularization of Infinitesimally-perturbed Gradient Descent Toward Low-dimensional Solutions [16.45408984254899]
Implicit regularization refers to the phenomenon where local search algorithms converge to low-dimensional solutions.<n>We show that successful implicit regularization hinges on the ability to efficiently escape strict saddle gradient, while maintaining proximity to the implicit region.
arXiv Detail & Related papers (2025-05-22T21:45:27Z)
First-ish Order Methods: Hessian-aware Scalings of Gradient Descent [11.125968799758436]
A key limitation of gradient descent is its lack of natural scaling. By accounting for curvature, our adaptive, Hessian-aware scaling method ensures a local unit step size. We show that our method converges globally under a significantly weaker version of the standard Lipschitz assumption.
arXiv Detail & Related papers (2025-02-06T01:22:23Z)
On the Convergence of Gradient Descent for Large Learning Rates [55.33626480243135]
We show that convergence is impossible when a fixed step size is used. We provide a proof of this in the case of linear neural networks with a squared loss. We also prove the impossibility of convergence for more general losses without requiring strong assumptions such as Lipschitz continuity for the gradient.
arXiv Detail & Related papers (2024-02-20T16:01:42Z)
Provably Accelerating Ill-Conditioned Low-rank Estimation via Scaled Gradient Descent, Even with Overparameterization [48.65416821017865]
This chapter introduces a new algorithmic approach, dubbed scaled gradient (ScaledGD) It converges linearly at a constant rate independent of the condition number of the low-rank object. It maintains the low periteration cost of gradient descent for a variety of tasks.
arXiv Detail & Related papers (2023-10-09T21:16:57Z)
Convergence of mean-field Langevin dynamics: Time and space discretization, stochastic gradient, and variance reduction [49.66486092259376]
The mean-field Langevin dynamics (MFLD) is a nonlinear generalization of the Langevin dynamics that incorporates a distribution-dependent drift. Recent works have shown that MFLD globally minimizes an entropy-regularized convex functional in the space of measures. We provide a framework to prove a uniform-in-time propagation of chaos for MFLD that takes into account the errors due to finite-particle approximation, time-discretization, and gradient approximation.
arXiv Detail & Related papers (2023-06-12T16:28:11Z)
Implicit Bias of Gradient Descent for Logistic Regression at the Edge of Stability [69.01076284478151]
In machine learning optimization, gradient descent (GD) often operates at the edge of stability (EoS) This paper studies the convergence and implicit bias of constant-stepsize GD for logistic regression on linearly separable data in the EoS regime.
arXiv Detail & Related papers (2023-05-19T16:24:47Z)
A PDE-based Explanation of Extreme Numerical Sensitivities and Edge of Stability in Training Neural Networks [12.355137704908042]
We show restrained numerical instabilities in current training practices of deep networks with gradient descent (SGD) We do this by presenting a theoretical framework using numerical analysis of partial differential equations (PDE), and analyzing the gradient descent PDE of convolutional neural networks (CNNs) We show this is a consequence of the non-linear PDE associated with the descent of the CNN, whose local linearization changes when over-driving the step size of the discretization resulting in a stabilizing effect.
arXiv Detail & Related papers (2022-06-04T14:54:05Z)
A Local Convergence Theory for the Stochastic Gradient Descent Method in Non-Convex Optimization With Non-isolated Local Minima [0.0]
Non-isolated minima presents a unique challenge that has remained under-explored. In this paper, we study the local convergence of the gradient descent method to non-isolated global minima.
arXiv Detail & Related papers (2022-03-21T13:33:37Z)
On the Convergence of Stochastic Extragradient for Bilinear Games with Restarted Iteration Averaging [96.13485146617322]
We present an analysis of the ExtraGradient (SEG) method with constant step size, and present variations of the method that yield favorable convergence. We prove that when augmented with averaging, SEG provably converges to the Nash equilibrium, and such a rate is provably accelerated by incorporating a scheduled restarting procedure.
arXiv Detail & Related papers (2021-06-30T17:51:36Z)
Fine-Grained Analysis of Stability and Generalization for Stochastic Gradient Descent [55.85456985750134]
We introduce a new stability measure called on-average model stability, for which we develop novel bounds controlled by the risks of SGD iterates. This yields generalization bounds depending on the behavior of the best model, and leads to the first-ever-known fast bounds in the low-noise setting. To our best knowledge, this gives the firstever-known stability and generalization for SGD with even non-differentiable loss functions.
arXiv Detail & Related papers (2020-06-15T06:30:19Z)
Convergence and sample complexity of gradient methods for the model-free linear quadratic regulator problem [27.09339991866556]
We show that ODE searches for optimal control for an unknown computation system by directly searching over the corresponding space of controllers. We take a step towards demystifying the performance and efficiency of such methods by focusing on the gradient-flow dynamics set of stabilizing feedback gains and a similar result holds for the forward disctization of the ODE.
arXiv Detail & Related papers (2019-12-26T16:56:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.