Related papers: Can Stability be Detrimental? Better Generalization through Gradient Descent Instabilities

Can Stability be Detrimental? Better Generalization through Gradient Descent Instabilities

URL: http://arxiv.org/abs/2412.17613v1
Date: Mon, 23 Dec 2024 14:32:53 GMT
Title: Can Stability be Detrimental? Better Generalization through Gradient Descent Instabilities
Authors: Lawrence Wang, Stephen J. Roberts,
Abstract summary: We show that instabilities induced by large learning rates move model parameters toward flatter regions of the loss landscape.<n>We find these lead to excellent generalization performance on modern benchmark datasets.
Score: 14.741581246137404
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Traditional analyses of gradient descent optimization show that, when the largest eigenvalue of the loss Hessian - often referred to as the sharpness - is below a critical learning-rate threshold, then training is 'stable' and training loss decreases monotonically. Recent studies, however, have suggested that the majority of modern deep neural networks achieve good performance despite operating outside this stable regime. In this work, we demonstrate that such instabilities, induced by large learning rates, move model parameters toward flatter regions of the loss landscape. Our crucial insight lies in noting that, during these instabilities, the orientation of the Hessian eigenvectors rotate. This, we conjecture, allows the model to explore regions of the loss landscape that display more desirable geometrical properties for generalization, such as flatness. These rotations are a consequence of network depth, and we prove that for any network with depth > 1, unstable growth in parameters cause rotations in the principal components of the Hessian, which promote exploration of the parameter space away from unstable directions. Our empirical studies reveal an implicit regularization effect in gradient descent with large learning rates operating beyond the stability threshold. We find these lead to excellent generalization performance on modern benchmark datasets.

Related papers

Description of the Training Process of Neural Networks via Ergodic Theorem : Ghost nodes [3.637162892228131]
We present a unified framework for understanding and accelerating deep neural networks via training gradient descent (SGD)<n>We introduce a practical diagnostic, the running estimate of the largest Lyapunov exponent, which distinguishes genuine convergence toward stablers.<n>We propose a ghost category extension for standard classifiers that adds auxiliary ghost output nodes so the model gains extra descent directions.
arXiv Detail & Related papers (2025-07-01T17:54:35Z)
Navigating loss manifolds via rigid body dynamics: A promising avenue for robustness and generalisation [11.729464930866483]
Training large neural networks through gradient-based optimization requires navigating high-dimensional loss landscapes.<n>We propose an alternative that simultaneously reduces this dependence, and avoids sharp minima.
arXiv Detail & Related papers (2025-05-26T05:26:21Z)
On the Convergence of Gradient Descent for Large Learning Rates [55.33626480243135]
We show that convergence is impossible when a fixed step size is used.<n>We provide a proof of this in the case of linear neural networks with a squared loss.<n>We also prove the impossibility of convergence for more general losses without requiring strong assumptions such as Lipschitz continuity for the gradient.
arXiv Detail & Related papers (2024-02-20T16:01:42Z)
On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics. The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z)
Universal Sharpness Dynamics in Neural Network Training: Fixed Point Analysis, Edge of Stability, and Route to Chaos [6.579523168465526]
In descent dynamics of neural networks, the top eigenvalue of the Hessian of the loss (sharpness) displays a variety of robust phenomena throughout training. We demonstrate that a simple $2$-layer linear network (UV model) trained on a single training example exhibits all of the essential sharpness phenomenology observed in real-world scenarios.
arXiv Detail & Related papers (2023-11-03T17:59:40Z)
On the ISS Property of the Gradient Flow for Single Hidden-Layer Neural Networks with Linear Activations [0.0]
We investigate the effects of overfitting on the robustness of gradient-descent training when subject to uncertainty on the gradient estimation. We show that the general overparametrized formulation introduces a set of spurious equilibria which lay outside the set where the loss function is minimized.
arXiv Detail & Related papers (2023-05-17T02:26:34Z)
SGD with Large Step Sizes Learns Sparse Features [22.959258640051342]
We showcase important features of the dynamics of the Gradient Descent (SGD) in the training of neural networks. We show that the longer large step sizes keep SGD high in the loss landscape, the better the implicit regularization can operate and find sparse representations.
arXiv Detail & Related papers (2022-10-11T11:00:04Z)
Stability and Generalization Analysis of Gradient Methods for Shallow Neural Networks [59.142826407441106]
We study the generalization behavior of shallow neural networks (SNNs) by leveraging the concept of algorithmic stability. We consider gradient descent (GD) and gradient descent (SGD) to train SNNs, for both of which we develop consistent excess bounds.
arXiv Detail & Related papers (2022-09-19T18:48:00Z)
Beyond the Edge of Stability via Two-step Gradient Updates [49.03389279816152]
Gradient Descent (GD) is a powerful workhorse of modern machine learning. GD's ability to find local minimisers is only guaranteed for losses with Lipschitz gradients. This work focuses on simple, yet representative, learning problems via analysis of two-step gradient updates.
arXiv Detail & Related papers (2022-06-08T21:32:50Z)
Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability [94.4070247697549]
Full-batch gradient descent on neural network training objectives operates in a regime we call the Edge of Stability. In this regime, the maximum eigenvalue of the training loss Hessian hovers just above the numerical value $2 / text(step size)$, and the training loss behaves non-monotonically over short timescales, yet consistently decreases over long timescales.
arXiv Detail & Related papers (2021-02-26T22:08:19Z)
Layer-wise Conditioning Analysis in Exploring the Learning Dynamics of DNNs [115.35745188028169]
We extend conditioning analysis to deep neural networks (DNNs) in order to investigate their learning dynamics. We show that batch normalization (BN) can stabilize the training, but sometimes result in the false impression of a local minimum. We experimentally observe that BN can improve the layer-wise conditioning of the optimization problem.
arXiv Detail & Related papers (2020-02-25T11:40:27Z)
The Break-Even Point on Optimization Trajectories of Deep Neural Networks [64.7563588124004]
We argue for the existence of the "break-even" point on this trajectory. We show that using a large learning rate in the initial phase of training reduces the variance of the gradient. We also show that using a low learning rate results in bad conditioning of the loss surface even for a neural network with batch normalization layers.
arXiv Detail & Related papers (2020-02-21T22:55:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.