Transient learning dynamics drive escape from sharp valleys in Stochastic Gradient Descent
- URL: http://arxiv.org/abs/2601.10962v1
- Date: Fri, 16 Jan 2026 03:03:45 GMT
- Title: Transient learning dynamics drive escape from sharp valleys in Stochastic Gradient Descent
- Authors: Ning Yang, Yikuan Zhang, Qi Ouyang, Chao Tang, Yuhai Tu,
- Abstract summary: gradient descent (SGD) is central to deep learning, yet the origin of its preference for flatter, more generalizable solutions remains unclear.<n>We identify a nonequilibrium mechanism governing solution selection.<n>We show that the SGD noise reshapes the landscape into an effective potential that favors flat solutions.
- Score: 8.338308750427682
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Stochastic gradient descent (SGD) is central to deep learning, yet the dynamical origin of its preference for flatter, more generalizable solutions remains unclear. Here, by analyzing SGD learning dynamics, we identify a nonequilibrium mechanism governing solution selection. Numerical experiments reveal a transient exploratory phase in which SGD trajectories repeatedly escape sharp valleys and transition toward flatter regions of the loss landscape. By using a tractable physical model, we show that the SGD noise reshapes the landscape into an effective potential that favors flat solutions. Crucially, we uncover a transient freezing mechanism: as training proceeds, growing energy barriers suppress inter-valley transitions and ultimately trap the dynamics within a single basin. Increasing the SGD noise strength delays this freezing, which enhances convergence to flatter minima. Together, these results provide a unified physical framework linking learning dynamics, loss-landscape geometry, and generalization, and suggest principles for the design of more effective optimization algorithms.
Related papers
- Dynamic Decoupling of Placid Terminal Attractor-based Gradient Descent Algorithm [56.06235614890066]
Gradient descent (GD) and gradient descent (SGD) have been widely used in a number of application domains.
This paper carefully analyzes the dynamics of GD based on the terminal attractor at different stages of its gradient flow.
arXiv Detail & Related papers (2024-09-10T14:15:56Z) - On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics.
The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z) - Butterfly Effects of SGD Noise: Error Amplification in Behavior Cloning
and Autoregression [70.78523583702209]
We study training instabilities of behavior cloning with deep neural networks.
We observe that minibatch SGD updates to the policy network during training result in sharp oscillations in long-horizon rewards.
arXiv Detail & Related papers (2023-10-17T17:39:40Z) - Convergence of mean-field Langevin dynamics: Time and space
discretization, stochastic gradient, and variance reduction [49.66486092259376]
The mean-field Langevin dynamics (MFLD) is a nonlinear generalization of the Langevin dynamics that incorporates a distribution-dependent drift.
Recent works have shown that MFLD globally minimizes an entropy-regularized convex functional in the space of measures.
We provide a framework to prove a uniform-in-time propagation of chaos for MFLD that takes into account the errors due to finite-particle approximation, time-discretization, and gradient approximation.
arXiv Detail & Related papers (2023-06-12T16:28:11Z) - Implicit Stochastic Gradient Descent for Training Physics-informed
Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems.
PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features.
In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z) - SGD with Large Step Sizes Learns Sparse Features [22.959258640051342]
We showcase important features of the dynamics of the Gradient Descent (SGD) in the training of neural networks.
We show that the longer large step sizes keep SGD high in the loss landscape, the better the implicit regularization can operate and find sparse representations.
arXiv Detail & Related papers (2022-10-11T11:00:04Z) - Stochastic gradient descent introduces an effective landscape-dependent
regularization favoring flat solutions [5.022507593837554]
Generalization is one of the most important problems in deep learning (DL)
There exist many low-loss solutions that fit the training data equally well.
The key question is which solution is more generalizable.
arXiv Detail & Related papers (2022-06-02T18:49:36Z) - The Limiting Dynamics of SGD: Modified Loss, Phase Space Oscillations,
and Anomalous Diffusion [29.489737359897312]
We study the limiting dynamics of deep neural networks trained with gradient descent (SGD)
We show that the key ingredient driving these dynamics is not the original training loss, but rather the combination of a modified loss, which implicitly regularizes the velocity and probability currents, which cause oscillations in phase space.
arXiv Detail & Related papers (2021-07-19T20:18:57Z) - Anomalous diffusion dynamics of learning in deep neural networks [0.0]
Learning in deep neural networks (DNNs) is implemented through minimizing a highly non-equilibrium loss function.
We present a novel account of how such effective deep learning emerges through the interactions of the fractal-like structure of the loss landscape.
arXiv Detail & Related papers (2020-09-22T14:57:59Z) - The Break-Even Point on Optimization Trajectories of Deep Neural
Networks [64.7563588124004]
We argue for the existence of the "break-even" point on this trajectory.
We show that using a large learning rate in the initial phase of training reduces the variance of the gradient.
We also show that using a low learning rate results in bad conditioning of the loss surface even for a neural network with batch normalization layers.
arXiv Detail & Related papers (2020-02-21T22:55:51Z) - How neural networks find generalizable solutions: Self-tuned annealing
in deep learning [7.372592187197655]
We find a robust inverse relation between the weight variance and the landscape flatness for all SGD-based learning algorithms.
Our study indicates that SGD attains a self-tuned landscape-dependent annealing strategy to find generalizable solutions at the flat minima of the landscape.
arXiv Detail & Related papers (2020-01-06T17:35:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.