Related papers: On the Convergence of Gradient Descent for Large Learning Rates

On the Convergence of Gradient Descent for Large Learning Rates

URL: http://arxiv.org/abs/2402.13108v2
Date: Tue, 3 Sep 2024 14:09:08 GMT
Title: On the Convergence of Gradient Descent for Large Learning Rates
Authors: Alexandru Crăciun, Debarghya Ghoshdastidar,
Abstract summary: We show that convergence is impossible when a fixed step size is used. We provide a proof of this in the case of linear neural networks with a squared loss. We also prove the impossibility of convergence for more general losses without requiring strong assumptions such as Lipschitz continuity for the gradient.
Score: 55.33626480243135
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A vast literature on convergence guarantees for gradient descent and derived methods exists at the moment. However, a simple practical situation remains unexplored: when a fixed step size is used, can we expect gradient descent to converge starting from any initialization? We provide fundamental impossibility results showing that convergence becomes impossible no matter the initialization if the step size gets too big. Looking at the asymptotic value of the gradient norm along the optimization trajectory, we see that there is a phase transition as the step size crosses a critical value. This has been observed by practitioners, yet the true mechanisms through which this happens remain unclear beyond heuristics. Using results from dynamical systems theory, we provide a proof of this in the case of linear neural networks with a squared loss. We also prove the impossibility of convergence for more general losses without requiring strong assumptions such as Lipschitz continuity for the gradient. We validate our findings through experiments with non-linear networks.

Related papers

Global Convergence Analysis of Vanilla Gradient Descent for Asymmetric Matrix Completion [21.544089013107392]
This paper investigates the asymmetric low-rank matrix completion problem.<n>It can be formulated as an unconstrained non-constrained optimization problem with a nonlinear least-squares completion function.<n> gradient descent approaches typically incorporate regularization terms into the objective function to guarantee convergence.
arXiv Detail & Related papers (2025-08-13T10:23:32Z)
SAD Neural Networks: Divergent Gradient Flows and Asymptotic Optimality via o-minimal Structures [3.3123773366516645]
We study gradient flows for loss landscapes of fully connected feed forward neural networks with commonly used continuously differentiable activation functions.<n>We prove that the gradient flow either converges to a critical point or diverges to infinity while the loss converges to an critical value.
arXiv Detail & Related papers (2025-05-14T17:15:11Z)
Limit Theorems for Stochastic Gradient Descent with Infinite Variance [51.4853131023238]
We show that the gradient descent algorithm can be characterized as the stationary distribution of a suitably defined Ornstein-rnstein process driven by an appropriate L'evy process.<n>We also explore the applications of these results in linear regression and logistic regression models.
arXiv Detail & Related papers (2024-10-21T09:39:10Z)
Good regularity creates large learning rate implicit biases: edge of stability, balancing, and catapult [49.8719617899285]
Large learning rates, when applied to objective descent for non optimization, yield various implicit biases including the edge of stability. This paper provides an initial step in descent and shows that these implicit biases are in fact various tips same iceberg.
arXiv Detail & Related papers (2023-10-26T01:11:17Z)
Implicit Bias of Gradient Descent for Logistic Regression at the Edge of Stability [69.01076284478151]
In machine learning optimization, gradient descent (GD) often operates at the edge of stability (EoS) This paper studies the convergence and implicit bias of constant-stepsize GD for logistic regression on linearly separable data in the EoS regime.
arXiv Detail & Related papers (2023-05-19T16:24:47Z)
Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations. For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two. For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z)
Understanding the unstable convergence of gradient descent [51.40523554349091]
In machine learning applications step sizes often do not fulfill the condition that for $L$-smooth cost, the step size is less than $2/L$. We investigate this unstable convergence phenomenon from first principles, and elucidate key causes behind it. We also identify its main characteristics, and how they interrelate, offering a transparent view backed by both theory and experiments.
arXiv Detail & Related papers (2022-04-03T11:10:17Z)
On Convergence of Training Loss Without Reaching Stationary Points [62.41370821014218]
We show that Neural Network weight variables do not converge to stationary points where the gradient the loss function vanishes. We propose a new perspective based on ergodic theory dynamical systems.
arXiv Detail & Related papers (2021-10-12T18:12:23Z)
Convergence of gradient descent for learning linear neural networks [2.209921757303168]
We show that gradient descent converges to a critical point of the loss function, i.e., the square loss in this article. In the case of three or more layers we show that gradient descent converges to a global minimum on the manifold matrices of some fixed rank.
arXiv Detail & Related papers (2021-08-04T13:10:30Z)
Continuous vs. Discrete Optimization of Deep Neural Networks [15.508460240818575]
We show that over deep neural networks with homogeneous activations, gradient flow trajectories enjoy favorable curvature. This finding allows us to translate an analysis of gradient flow over deep linear neural networks into a guarantee that gradient descent efficiently converges to global minimum. We hypothesize that the theory of gradient flows will be central to unraveling mysteries behind deep learning.
arXiv Detail & Related papers (2021-07-14T10:59:57Z)
Learning Quantized Neural Nets by Coarse Gradient Method for Non-linear Classification [3.158346511479111]
We propose a class of STEs with certain monotonicity, and consider their applications to the training of a two-linear-layer network with quantized activation functions. We establish performance guarantees for the proposed STEs by showing that the corresponding coarse gradient methods converge to the global minimum.
arXiv Detail & Related papers (2020-11-23T07:50:09Z)
Training Two-Layer ReLU Networks with Gradient Descent is Inconsistent [2.7793394375935088]
We prove that two-layer (Leaky)ReLU networks by e.g. the widely used method proposed by He et al. are not consistent.
arXiv Detail & Related papers (2020-02-12T09:22:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.