Related papers: Understanding Gradient Descent on Edge of Stability in Deep Learning

Understanding Gradient Descent on Edge of Stability in Deep Learning

URL: http://arxiv.org/abs/2205.09745v1
Date: Thu, 19 May 2022 17:57:01 GMT
Title: Understanding Gradient Descent on Edge of Stability in Deep Learning
Authors: Sanjeev Arora, Zhiyuan Li, Abhishek Panigrahi
Abstract summary: This paper mathematically analyzes a new mechanism of implicit regularization in the EoS phase, whereby GD updates due to non-smooth loss landscape turn out to evolve along some deterministic flow on the manifold of minimum loss. The above theoretical results have been corroborated by an experimental study.
Score: 32.03036040349019
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Deep learning experiments in Cohen et al. (2021) using deterministic Gradient Descent (GD) revealed an {\em Edge of Stability (EoS)} phase when learning rate (LR) and sharpness (\emph{i.e.}, the largest eigenvalue of Hessian) no longer behave as in traditional optimization. Sharpness stabilizes around $2/$LR and loss goes up and down across iterations, yet still with an overall downward trend. The current paper mathematically analyzes a new mechanism of implicit regularization in the EoS phase, whereby GD updates due to non-smooth loss landscape turn out to evolve along some deterministic flow on the manifold of minimum loss. This is in contrast to many previous results about implicit bias either relying on infinitesimal updates or noise in gradient. Formally, for any smooth function $L$ with certain regularity condition, this effect is demonstrated for (1) {\em Normalized GD}, i.e., GD with a varying LR $ \eta_t =\frac{ \eta }{ || \nabla L(x(t)) || } $ and loss $L$; (2) GD with constant LR and loss $\sqrt{L}$. Both provably enter the Edge of Stability, with the associated flow on the manifold minimizing $\lambda_{\max}(\nabla^2 L)$. The above theoretical results have been corroborated by an experimental study.

Related papers

Edge of Stochastic Stability: Revisiting the Edge of Stability for SGD [0.0]
We show that mini-batch gradient descent (SGD) trains in a different regime we term Edge of Stability (EoSS) What stabilizes at $2/eta$ is *Batch Sharpness*: the expected directional curvature of mini-batch Hessians along their corresponding gradients. We further discuss implications for mathematical modeling of SGD trajectories.
arXiv Detail & Related papers (2024-12-29T18:59:01Z)
Convergence Rate Analysis of LION [54.28350823319057]
LION converges iterations of $cal(sqrtdK-)$ measured by gradient Karush-Kuhn-T (sqrtdK-)$. We show that LION can achieve lower loss and higher performance compared to standard SGD.
arXiv Detail & Related papers (2024-11-12T11:30:53Z)
Methods for Convex $(L_0,L_1)$-Smooth Optimization: Clipping, Acceleration, and Adaptivity [50.25258834153574]
We focus on the class of (strongly) convex $(L0)$-smooth functions and derive new convergence guarantees for several existing methods. In particular, we derive improved convergence rates for Gradient Descent with smoothnessed Gradient Clipping and for Gradient Descent with Polyak Stepsizes.
arXiv Detail & Related papers (2024-09-23T13:11:37Z)
Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency [47.8739414267201]
We consider gradient descent (GD) with a constant stepsize applied to logistic regression with linearly separable data. We show that GD exits this initial oscillatory phase rapidly -- in $mathcalO(eta)$ steps -- and subsequently achieves an $tildemathcalO (1 / (eta t) )$ convergence rate. Our results imply that, given a budget of $T$ steps, GD can achieve an accelerated loss of $tildemathcalO (1/T2)$ with an aggressive stepsize
arXiv Detail & Related papers (2024-02-24T23:10:28Z)
Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel. An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost. Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z)
Lower Generalization Bounds for GD and SGD in Smooth Stochastic Convex Optimization [9.019243171993553]
Training steps $T$ and step-size $eta$ might affect certify in smooth convex optimization (SCO) problems. We first provide tight excess risk lower bounds for Gradient Descent (GD) and Gradient Descent (SGD) Recent works show better rates can be attained but the improvement is reduced when training time is long.
arXiv Detail & Related papers (2023-03-19T20:24:33Z)
Variance-reduced Clipping for Non-convex Optimization [24.765794811146144]
Gradient clipping is a technique used in deep learning applications such as large-scale language modeling. Recent experimental training have a fairly special behavior in that it mitigates order complexity.
arXiv Detail & Related papers (2023-03-02T00:57:38Z)
Generalization Bounds for Gradient Methods via Discrete and Continuous Prior [8.76346911214414]
We show a new high probability generalization bound of order $O(frac1n + fracL2n2sum_t=1T(gamma_t/varepsilon_t)2)$ for gradient Langevin Dynamics (GLD) We can also obtain new bounds for certain variants of SGD.
arXiv Detail & Related papers (2022-05-27T07:23:01Z)
Improved Convergence Rate of Stochastic Gradient Langevin Dynamics with Variance Reduction and its Application to Optimization [50.83356836818667]
gradient Langevin Dynamics is one of the most fundamental algorithms to solve non-eps optimization problems. In this paper, we show two variants of this kind, namely the Variance Reduced Langevin Dynamics and the Recursive Gradient Langevin Dynamics.
arXiv Detail & Related papers (2022-03-30T11:39:00Z)
Black-Box Generalization [31.80268332522017]
We provide the first error analysis for black-box learning through derivative generalization. We show both generalization are independent $d$, $K$ and under appropriate choices a slightly decreased learning rate.
arXiv Detail & Related papers (2022-02-14T17:14:48Z)
What Happens after SGD Reaches Zero Loss? --A Mathematical Framework [35.31946061894308]
Understanding the implicit bias of Gradient Descent (SGD) is one of the key challenges in deep learning. This paper gives a general framework for such analysis by adapting ideas from Katzenberger (1991). It yields some new results: (1) a global analysis of the implicit bias valid for $eta-2$ steps, in contrast to the local analysis of Blanc et al. ( 2020) that is only valid for $eta-1.6$ steps and (2) allowing arbitrary noise covariance.
arXiv Detail & Related papers (2021-10-13T17:50:46Z)
Direction Matters: On the Implicit Bias of Stochastic Gradient Descent with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime. We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.