A Theoretical Analysis of Noise Geometry in Stochastic Gradient Descent
- URL: http://arxiv.org/abs/2310.00692v3
- Date: Thu, 1 Feb 2024 11:15:37 GMT
- Title: A Theoretical Analysis of Noise Geometry in Stochastic Gradient Descent
- Authors: Mingze Wang, Lei Wu
- Abstract summary: Minibatch gradient descent (SGD) is a geometry phenomenon where noise aligns favorably with the geometry of local landscape.
We propose two metrics, derived from analyzing how noise influences the loss and subspace projection dynamics, to quantify the alignment strength.
- Score: 9.064667124987068
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we provide a theoretical study of noise geometry for minibatch
stochastic gradient descent (SGD), a phenomenon where noise aligns favorably
with the geometry of local landscape. We propose two metrics, derived from
analyzing how noise influences the loss and subspace projection dynamics, to
quantify the alignment strength. We show that for (over-parameterized) linear
models and two-layer nonlinear networks, when measured by these metrics, the
alignment can be provably guaranteed under conditions independent of the degree
of over-parameterization. To showcase the utility of our noise geometry
characterizations, we present a refined analysis of the mechanism by which SGD
escapes from sharp minima. We reveal that unlike gradient descent (GD), which
escapes along the sharpest directions, SGD tends to escape from flatter
directions and cyclical learning rates can exploit this SGD characteristic to
navigate more effectively towards flatter regions. Lastly, extensive
experiments are provided to support our theoretical findings.
Related papers
- Implicit Bias of Gradient Descent for Logistic Regression at the Edge of
Stability [69.01076284478151]
In machine learning optimization, gradient descent (GD) often operates at the edge of stability (EoS)
This paper studies the convergence and implicit bias of constant-stepsize GD for logistic regression on linearly separable data in the EoS regime.
arXiv Detail & Related papers (2023-05-19T16:24:47Z) - Doubly Stochastic Models: Learning with Unbiased Label Noises and
Inference Stability [85.1044381834036]
We investigate the implicit regularization effects of label noises under mini-batch sampling settings of gradient descent.
We find such implicit regularizer would favor some convergence points that could stabilize model outputs against perturbation of parameters.
Our work doesn't assume SGD as an Ornstein-Uhlenbeck like process and achieve a more general result with convergence of approximation proved.
arXiv Detail & Related papers (2023-04-01T14:09:07Z) - Per-Example Gradient Regularization Improves Learning Signals from Noisy
Data [25.646054298195434]
Empirical evidence suggests that gradient regularization technique can significantly enhance the robustness of deep learning models against noisy perturbations.
We present a theoretical analysis that demonstrates its effectiveness in improving both test error and robustness against noise perturbations.
Our analysis reveals that PEGR penalizes the variance of pattern learning, thus effectively suppressing the memorization of noises from the training data.
arXiv Detail & Related papers (2023-03-31T10:08:23Z) - Beyond the Edge of Stability via Two-step Gradient Updates [49.03389279816152]
Gradient Descent (GD) is a powerful workhorse of modern machine learning.
GD's ability to find local minimisers is only guaranteed for losses with Lipschitz gradients.
This work focuses on simple, yet representative, learning problems via analysis of two-step gradient updates.
arXiv Detail & Related papers (2022-06-08T21:32:50Z) - Quasi-potential theory for escape problem: Quantitative sharpness effect
on SGD's escape from local minima [10.990447273771592]
We develop a quantitative theory on a slow gradient descent (SGD) algorithm.
We investigate the effect of sharpness of loss surfaces on the noise neural networks.
arXiv Detail & Related papers (2021-11-07T05:00:35Z) - Optimizing Information-theoretical Generalization Bounds via Anisotropic
Noise in SGLD [73.55632827932101]
We optimize the information-theoretical generalization bound by manipulating the noise structure in SGLD.
We prove that with constraint to guarantee low empirical risk, the optimal noise covariance is the square root of the expected gradient covariance.
arXiv Detail & Related papers (2021-10-26T15:02:27Z) - GELATO: Geometrically Enriched Latent Model for Offline Reinforcement
Learning [54.291331971813364]
offline reinforcement learning approaches can be divided into proximal and uncertainty-aware methods.
In this work, we demonstrate the benefit of combining the two in a latent variational model.
Our proposed metrics measure both the quality of out of distribution samples as well as the discrepancy of examples in the data.
arXiv Detail & Related papers (2021-02-22T19:42:40Z) - Noise and Fluctuation of Finite Learning Rate Stochastic Gradient
Descent [3.0079490585515343]
gradient descent (SGD) is relatively well understood in the vanishing learning rate regime.
We propose to study the basic properties of SGD and its variants in the non-vanishing learning rate regime.
arXiv Detail & Related papers (2020-12-07T12:31:43Z) - Shape Matters: Understanding the Implicit Bias of the Noise Covariance [76.54300276636982]
Noise in gradient descent provides a crucial implicit regularization effect for training over parameterized models.
We show that parameter-dependent noise -- induced by mini-batches or label perturbation -- is far more effective than Gaussian noise.
Our analysis reveals that parameter-dependent noise introduces a bias towards local minima with smaller noise variance, whereas spherical Gaussian noise does not.
arXiv Detail & Related papers (2020-06-15T18:31:02Z) - Fractional Underdamped Langevin Dynamics: Retargeting SGD with Momentum
under Heavy-Tailed Gradient Noise [39.9241638707715]
We show that FULD has similarities with enatural and egradient methods on their role in deep learning.
arXiv Detail & Related papers (2020-02-13T18:04:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.