Training Instabilities Induce Flatness Bias in Gradient Descent
- URL: http://arxiv.org/abs/2511.12558v1
- Date: Sun, 16 Nov 2025 11:26:25 GMT
- Title: Training Instabilities Induce Flatness Bias in Gradient Descent
- Authors: Lawrence Wang, Stephen J. Roberts,
- Abstract summary: Modern deep networks often achieve their best performance beyond a stability threshold.<n>We show that training instabilities induce an implicit bias in GD, driving parameters toward flatter regions of the loss landscape.<n>We also show that restoring instabilities in Adam further improves generalization.
- Score: 6.628332915214955
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Classical analyses of gradient descent (GD) define a stability threshold based on the largest eigenvalue of the loss Hessian, often termed sharpness. When the learning rate lies below this threshold, training is stable and the loss decreases monotonically. Yet, modern deep networks often achieve their best performance beyond this regime. We demonstrate that such instabilities induce an implicit bias in GD, driving parameters toward flatter regions of the loss landscape and thereby improving generalization. The key mechanism is the Rotational Polarity of Eigenvectors (RPE), a geometric phenomenon in which the leading eigenvectors of the Hessian rotate during training instabilities. These rotations, which increase with learning rates, promote exploration and provably lead to flatter minima. This theoretical framework extends to stochastic GD, where instability-driven flattening persists and its empirical effects outweigh minibatch noise. Finally, we show that restoring instabilities in Adam further improves generalization. Together, these results establish and understand the constructive role of training instabilities in deep learning.
Related papers
- Temporal Imbalance of Positive and Negative Supervision in Class-Incremental Learning [10.054396813990481]
CIL faces the core challenge of catastrophic forgetting, often manifested as a prediction bias toward new classes.<n>Existing methods mainly attribute this bias to intra-task class imbalance and focus on corrections at the classifier head.<n>We propose Temporal-Adjusted Loss (TAL), which uses a temporal decay kernel to construct a supervision strength vector and dynamically reweight the negative supervision in cross-entropy loss.
arXiv Detail & Related papers (2026-03-02T01:57:52Z) - Conflicting Biases at the Edge of Stability: Norm versus Sharpness Regularization [12.58055746943097]
We argue that a comprehensive understanding of the generalization performance of gradient descent requires analyzing the interaction between these various forms of implicit regularization.<n>We prove for diagonal linear networks trained on a simple regression task that neither implicit bias alone minimizes the generalization error.
arXiv Detail & Related papers (2025-05-27T16:51:06Z) - Can Stability be Detrimental? Better Generalization through Gradient Descent Instabilities [14.741581246137404]
We show that instabilities induced by large learning rates move model parameters toward flatter regions of the loss landscape.<n>We find these lead to excellent generalization performance on modern benchmark datasets.
arXiv Detail & Related papers (2024-12-23T14:32:53Z) - Momentum Does Not Reduce Stochastic Noise in Stochastic Gradient Descent [0.6906005491572401]
In neural deep networks, gradient descent (SGD) with momentum is said to converge faster and have better generalizability than SGD without momentum.<n>In particular, adding momentum is thought to reduce this batch noise.<n>We analyzed the effect of search direction noise, which is noise defined as the error between the search direction and the steepest descent direction.
arXiv Detail & Related papers (2024-02-04T02:48:28Z) - On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics.
The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z) - Butterfly Effects of SGD Noise: Error Amplification in Behavior Cloning
and Autoregression [70.78523583702209]
We study training instabilities of behavior cloning with deep neural networks.
We observe that minibatch SGD updates to the policy network during training result in sharp oscillations in long-horizon rewards.
arXiv Detail & Related papers (2023-10-17T17:39:40Z) - On a continuous time model of gradient descent dynamics and instability
in deep learning [12.20253214080485]
We propose the principal flow (PF) as a continuous time flow that approximates gradient descent dynamics.
The PF sheds light on the recently observed edge of stability phenomena in deep learning.
Using our new understanding of instability we propose a learning rate adaptation method which enables us to control the trade-off between training stability and test set evaluation performance.
arXiv Detail & Related papers (2023-02-03T19:03:10Z) - SGD with Large Step Sizes Learns Sparse Features [22.959258640051342]
We showcase important features of the dynamics of the Gradient Descent (SGD) in the training of neural networks.
We show that the longer large step sizes keep SGD high in the loss landscape, the better the implicit regularization can operate and find sparse representations.
arXiv Detail & Related papers (2022-10-11T11:00:04Z) - Stability and Generalization Analysis of Gradient Methods for Shallow
Neural Networks [59.142826407441106]
We study the generalization behavior of shallow neural networks (SNNs) by leveraging the concept of algorithmic stability.
We consider gradient descent (GD) and gradient descent (SGD) to train SNNs, for both of which we develop consistent excess bounds.
arXiv Detail & Related papers (2022-09-19T18:48:00Z) - Beyond the Edge of Stability via Two-step Gradient Updates [49.03389279816152]
Gradient Descent (GD) is a powerful workhorse of modern machine learning.
GD's ability to find local minimisers is only guaranteed for losses with Lipschitz gradients.
This work focuses on simple, yet representative, learning problems via analysis of two-step gradient updates.
arXiv Detail & Related papers (2022-06-08T21:32:50Z) - Stochastic Training is Not Necessary for Generalization [57.04880404584737]
It is widely believed that the implicit regularization of gradient descent (SGD) is fundamental to the impressive generalization behavior we observe in neural networks.
In this work, we demonstrate that non-stochastic full-batch training can achieve strong performance on CIFAR-10 that is on-par with SGD.
arXiv Detail & Related papers (2021-09-29T00:50:00Z) - The Break-Even Point on Optimization Trajectories of Deep Neural
Networks [64.7563588124004]
We argue for the existence of the "break-even" point on this trajectory.
We show that using a large learning rate in the initial phase of training reduces the variance of the gradient.
We also show that using a low learning rate results in bad conditioning of the loss surface even for a neural network with batch normalization layers.
arXiv Detail & Related papers (2020-02-21T22:55:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.