Type-II Saddles and Probabilistic Stability of Stochastic Gradient Descent
- URL: http://arxiv.org/abs/2303.13093v4
- Date: Tue, 2 Jul 2024 13:05:59 GMT
- Title: Type-II Saddles and Probabilistic Stability of Stochastic Gradient Descent
- Authors: Liu Ziyin, Botao Li, Tomer Galanti, Masahito Ueda,
- Abstract summary: We show that saddle points in neural networks can be divided into two types, among which the Type-II saddles are especially difficult to escape from because the gradient noise vanishes at the saddle.
The dynamics of SGD around these saddles are thus to leading order described by a random matrix product process.
We leverage to show that saddle points can be either attractive or repulsive for SGD, and its dynamics can be classified into four different phases, depending on the signal-to-noise ratio in the gradient close to the saddle.
- Score: 16.849376037005452
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Characterizing and understanding the dynamics of stochastic gradient descent (SGD) around saddle points remains an open problem. We first show that saddle points in neural networks can be divided into two types, among which the Type-II saddles are especially difficult to escape from because the gradient noise vanishes at the saddle. The dynamics of SGD around these saddles are thus to leading order described by a random matrix product process, and it is thus natural to study the dynamics of SGD around these saddles using the notion of probabilistic stability and the related Lyapunov exponent. Theoretically, we link the study of SGD dynamics to well-known concepts in ergodic theory, which we leverage to show that saddle points can be either attractive or repulsive for SGD, and its dynamics can be classified into four different phases, depending on the signal-to-noise ratio in the gradient close to the saddle.
Related papers
- Robust Stochastic Gradient Posterior Sampling with Lattice Based Discretisation [20.44428092865608]
MCMC methods enable scalable posterior sampling but often suffer from sensitivity to minibatch size and gradient noise.<n>We propose Gradient Random Walk (SGLRW), an extension of the Lattice Random Walk discretization.
arXiv Detail & Related papers (2026-02-17T18:09:49Z) - Stochastic Weakly Convex Optimization Under Heavy-Tailed Noises [55.43924214633558]
In this paper, we focus on two types of noises: one is sub-Weibull noises, and the other is SsBC noises.<n>Under these two noise assumptions, the in-expectation and high-probability convergence of SFOMs have been studied in the contexts of convex optimization and smooth optimization.
arXiv Detail & Related papers (2025-07-17T16:48:45Z) - Role of Momentum in Smoothing Objective Function and Generalizability of Deep Neural Networks [0.6906005491572401]
We show that noise in gradient descent (SGD) with momentum smoothes the objective function, the degree of which is determined by the learning rate, the batch size, the momentum factor, and the upper bound of the norm.
We also provide experimental results supporting our assertion model generalizability depends on the noise level.
arXiv Detail & Related papers (2024-02-04T02:48:28Z) - Doubly Stochastic Models: Learning with Unbiased Label Noises and
Inference Stability [85.1044381834036]
We investigate the implicit regularization effects of label noises under mini-batch sampling settings of gradient descent.
We find such implicit regularizer would favor some convergence points that could stabilize model outputs against perturbation of parameters.
Our work doesn't assume SGD as an Ornstein-Uhlenbeck like process and achieve a more general result with convergence of approximation proved.
arXiv Detail & Related papers (2023-04-01T14:09:07Z) - Almost Sure Saddle Avoidance of Stochastic Gradient Methods without the
Bounded Gradient Assumption [11.367487348673793]
We prove that various gradient descent methods, including the gradient descent (SGD), heavy-ball (SHB) and Nesterov's accelerated gradient (SNAG) methods, almost surely avoid any strict saddle manifold.
This is the first time such results are obtained for SHB and SNAG methods.
arXiv Detail & Related papers (2023-02-15T18:53:41Z) - SGD with Large Step Sizes Learns Sparse Features [22.959258640051342]
We showcase important features of the dynamics of the Gradient Descent (SGD) in the training of neural networks.
We show that the longer large step sizes keep SGD high in the loss landscape, the better the implicit regularization can operate and find sparse representations.
arXiv Detail & Related papers (2022-10-11T11:00:04Z) - Slow semiclassical dynamics of a two-dimensional Hubbard model in
disorder-free potentials [77.34726150561087]
We show that introduction of harmonic and spin-dependent linear potentials sufficiently validates fTWA for longer times.
In particular, we focus on a finite two-dimensional system and show that at intermediate linear potential strength, the addition of a harmonic potential and spin dependence of the tilt, results in subdiffusive dynamics.
arXiv Detail & Related papers (2022-10-03T16:51:25Z) - A constrained gentlest ascent dynamics and its applications to finding
excited states of Bose-Einstein condensates [6.6772808699409705]
We show that the linearly stable steady state of the proposed CGAD is exactly a nondegenerate constrained saddle point with a corresponding Morse index.
The CGAD is then applied to find excited states of single-component-Einstein condensates (BECs) in the order of their Morse indices.
arXiv Detail & Related papers (2022-09-10T15:09:07Z) - Beyond the Edge of Stability via Two-step Gradient Updates [49.03389279816152]
Gradient Descent (GD) is a powerful workhorse of modern machine learning.
GD's ability to find local minimisers is only guaranteed for losses with Lipschitz gradients.
This work focuses on simple, yet representative, learning problems via analysis of two-step gradient updates.
arXiv Detail & Related papers (2022-06-08T21:32:50Z) - Phase diagram of Rydberg-dressed atoms on two-leg square ladders:
Coupling supersymmetric conformal field theories on the lattice [52.77024349608834]
We investigate the phase diagram of hard-core bosons in two-leg ladders in the presence of soft-shoulder potentials.
We show how the competition between local and non-local terms gives rise to a phase diagram with liquid phases with dominant cluster, spin, and density-wave quasi-long-range ordering.
arXiv Detail & Related papers (2021-12-20T09:46:08Z) - On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD)
We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting.
We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z) - Escaping Saddle Points with Stochastically Controlled Stochastic
Gradient Methods [12.173568611144626]
We show that a first-order saddle gradient descent (SCSG) method can be perturbed by noise or a step.
A separate step is proposed to help solve this problem.
The proposed step is designed to incorporate the proposed CNC-SCSGD method further for saddle points.
arXiv Detail & Related papers (2021-03-07T18:09:43Z) - Fine-Grained Analysis of Stability and Generalization for Stochastic
Gradient Descent [55.85456985750134]
We introduce a new stability measure called on-average model stability, for which we develop novel bounds controlled by the risks of SGD iterates.
This yields generalization bounds depending on the behavior of the best model, and leads to the first-ever-known fast bounds in the low-noise setting.
To our best knowledge, this gives the firstever-known stability and generalization for SGD with even non-differentiable loss functions.
arXiv Detail & Related papers (2020-06-15T06:30:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.