On the Convergence of Gradient Descent for Large Learning Rates
- URL: http://arxiv.org/abs/2402.13108v2
- Date: Tue, 3 Sep 2024 14:09:08 GMT
- Title: On the Convergence of Gradient Descent for Large Learning Rates
- Authors: Alexandru Crăciun, Debarghya Ghoshdastidar,
- Abstract summary: We show that convergence is impossible when a fixed step size is used.
We provide a proof of this in the case of linear neural networks with a squared loss.
We also prove the impossibility of convergence for more general losses without requiring strong assumptions such as Lipschitz continuity for the gradient.
- Score: 55.33626480243135
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A vast literature on convergence guarantees for gradient descent and derived methods exists at the moment. However, a simple practical situation remains unexplored: when a fixed step size is used, can we expect gradient descent to converge starting from any initialization? We provide fundamental impossibility results showing that convergence becomes impossible no matter the initialization if the step size gets too big. Looking at the asymptotic value of the gradient norm along the optimization trajectory, we see that there is a phase transition as the step size crosses a critical value. This has been observed by practitioners, yet the true mechanisms through which this happens remain unclear beyond heuristics. Using results from dynamical systems theory, we provide a proof of this in the case of linear neural networks with a squared loss. We also prove the impossibility of convergence for more general losses without requiring strong assumptions such as Lipschitz continuity for the gradient. We validate our findings through experiments with non-linear networks.
Related papers
- Good regularity creates large learning rate implicit biases: edge of
stability, balancing, and catapult [49.8719617899285]
Large learning rates, when applied to objective descent for non optimization, yield various implicit biases including the edge of stability.
This paper provides an initial step in descent and shows that these implicit biases are in fact various tips same iceberg.
arXiv Detail & Related papers (2023-10-26T01:11:17Z) - Implicit Bias of Gradient Descent for Logistic Regression at the Edge of
Stability [69.01076284478151]
In machine learning optimization, gradient descent (GD) often operates at the edge of stability (EoS)
This paper studies the convergence and implicit bias of constant-stepsize GD for logistic regression on linearly separable data in the EoS regime.
arXiv Detail & Related papers (2023-05-19T16:24:47Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Understanding the unstable convergence of gradient descent [51.40523554349091]
In machine learning applications step sizes often do not fulfill the condition that for $L$-smooth cost, the step size is less than $2/L$.
We investigate this unstable convergence phenomenon from first principles, and elucidate key causes behind it.
We also identify its main characteristics, and how they interrelate, offering a transparent view backed by both theory and experiments.
arXiv Detail & Related papers (2022-04-03T11:10:17Z) - On Convergence of Training Loss Without Reaching Stationary Points [62.41370821014218]
We show that Neural Network weight variables do not converge to stationary points where the gradient the loss function vanishes.
We propose a new perspective based on ergodic theory dynamical systems.
arXiv Detail & Related papers (2021-10-12T18:12:23Z) - Convergence of gradient descent for learning linear neural networks [2.209921757303168]
We show that gradient descent converges to a critical point of the loss function, i.e., the square loss in this article.
In the case of three or more layers we show that gradient descent converges to a global minimum on the manifold matrices of some fixed rank.
arXiv Detail & Related papers (2021-08-04T13:10:30Z) - Continuous vs. Discrete Optimization of Deep Neural Networks [15.508460240818575]
We show that over deep neural networks with homogeneous activations, gradient flow trajectories enjoy favorable curvature.
This finding allows us to translate an analysis of gradient flow over deep linear neural networks into a guarantee that gradient descent efficiently converges to global minimum.
We hypothesize that the theory of gradient flows will be central to unraveling mysteries behind deep learning.
arXiv Detail & Related papers (2021-07-14T10:59:57Z) - Learning Quantized Neural Nets by Coarse Gradient Method for Non-linear
Classification [3.158346511479111]
We propose a class of STEs with certain monotonicity, and consider their applications to the training of a two-linear-layer network with quantized activation functions.
We establish performance guarantees for the proposed STEs by showing that the corresponding coarse gradient methods converge to the global minimum.
arXiv Detail & Related papers (2020-11-23T07:50:09Z) - Training Two-Layer ReLU Networks with Gradient Descent is Inconsistent [2.7793394375935088]
We prove that two-layer (Leaky)ReLU networks by e.g. the widely used method proposed by He et al. are not consistent.
arXiv Detail & Related papers (2020-02-12T09:22:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.