Related papers: A Theoretical Analysis of Noise Geometry in Stochastic Gradient Descent

A Theoretical Analysis of Noise Geometry in Stochastic Gradient Descent

URL: http://arxiv.org/abs/2310.00692v3
Date: Thu, 1 Feb 2024 11:15:37 GMT
Title: A Theoretical Analysis of Noise Geometry in Stochastic Gradient Descent
Authors: Mingze Wang, Lei Wu
Abstract summary: Minibatch gradient descent (SGD) is a geometry phenomenon where noise aligns favorably with the geometry of local landscape. We propose two metrics, derived from analyzing how noise influences the loss and subspace projection dynamics, to quantify the alignment strength.
Score: 9.064667124987068
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we provide a theoretical study of noise geometry for minibatch stochastic gradient descent (SGD), a phenomenon where noise aligns favorably with the geometry of local landscape. We propose two metrics, derived from analyzing how noise influences the loss and subspace projection dynamics, to quantify the alignment strength. We show that for (over-parameterized) linear models and two-layer nonlinear networks, when measured by these metrics, the alignment can be provably guaranteed under conditions independent of the degree of over-parameterization. To showcase the utility of our noise geometry characterizations, we present a refined analysis of the mechanism by which SGD escapes from sharp minima. We reveal that unlike gradient descent (GD), which escapes along the sharpest directions, SGD tends to escape from flatter directions and cyclical learning rates can exploit this SGD characteristic to navigate more effectively towards flatter regions. Lastly, extensive experiments are provided to support our theoretical findings.

Related papers

Training Diagonal Linear Networks with Stochastic Sharpness-Aware Minimization [7.032245866317619]
We analyze the landscape and training dynamics of diagonal linear networks in a linear regression task. We prove several results that relate its action on the underlying landscape and training dynamics to the sharpness of the loss.
arXiv Detail & Related papers (2025-03-14T21:45:12Z)
Implicit Bias of Gradient Descent for Logistic Regression at the Edge of Stability [69.01076284478151]
In machine learning optimization, gradient descent (GD) often operates at the edge of stability (EoS) This paper studies the convergence and implicit bias of constant-stepsize GD for logistic regression on linearly separable data in the EoS regime.
arXiv Detail & Related papers (2023-05-19T16:24:47Z)
Doubly Stochastic Models: Learning with Unbiased Label Noises and Inference Stability [85.1044381834036]
We investigate the implicit regularization effects of label noises under mini-batch sampling settings of gradient descent. We find such implicit regularizer would favor some convergence points that could stabilize model outputs against perturbation of parameters. Our work doesn't assume SGD as an Ornstein-Uhlenbeck like process and achieve a more general result with convergence of approximation proved.
arXiv Detail & Related papers (2023-04-01T14:09:07Z)
Per-Example Gradient Regularization Improves Learning Signals from Noisy Data [25.646054298195434]
Empirical evidence suggests that gradient regularization technique can significantly enhance the robustness of deep learning models against noisy perturbations. We present a theoretical analysis that demonstrates its effectiveness in improving both test error and robustness against noise perturbations. Our analysis reveals that PEGR penalizes the variance of pattern learning, thus effectively suppressing the memorization of noises from the training data.
arXiv Detail & Related papers (2023-03-31T10:08:23Z)
Beyond the Edge of Stability via Two-step Gradient Updates [49.03389279816152]
Gradient Descent (GD) is a powerful workhorse of modern machine learning. GD's ability to find local minimisers is only guaranteed for losses with Lipschitz gradients. This work focuses on simple, yet representative, learning problems via analysis of two-step gradient updates.
arXiv Detail & Related papers (2022-06-08T21:32:50Z)
Quasi-potential theory for escape problem: Quantitative sharpness effect on SGD's escape from local minima [10.990447273771592]
We develop a quantitative theory on a slow gradient descent (SGD) algorithm. We investigate the effect of sharpness of loss surfaces on the noise neural networks.
arXiv Detail & Related papers (2021-11-07T05:00:35Z)
Optimizing Information-theoretical Generalization Bounds via Anisotropic Noise in SGLD [73.55632827932101]
We optimize the information-theoretical generalization bound by manipulating the noise structure in SGLD. We prove that with constraint to guarantee low empirical risk, the optimal noise covariance is the square root of the expected gradient covariance.
arXiv Detail & Related papers (2021-10-26T15:02:27Z)
GELATO: Geometrically Enriched Latent Model for Offline Reinforcement Learning [54.291331971813364]
offline reinforcement learning approaches can be divided into proximal and uncertainty-aware methods. In this work, we demonstrate the benefit of combining the two in a latent variational model. Our proposed metrics measure both the quality of out of distribution samples as well as the discrepancy of examples in the data.
arXiv Detail & Related papers (2021-02-22T19:42:40Z)
Noise and Fluctuation of Finite Learning Rate Stochastic Gradient Descent [3.0079490585515343]
gradient descent (SGD) is relatively well understood in the vanishing learning rate regime. We propose to study the basic properties of SGD and its variants in the non-vanishing learning rate regime.
arXiv Detail & Related papers (2020-12-07T12:31:43Z)
Shape Matters: Understanding the Implicit Bias of the Noise Covariance [76.54300276636982]
Noise in gradient descent provides a crucial implicit regularization effect for training over parameterized models. We show that parameter-dependent noise -- induced by mini-batches or label perturbation -- is far more effective than Gaussian noise. Our analysis reveals that parameter-dependent noise introduces a bias towards local minima with smaller noise variance, whereas spherical Gaussian noise does not.
arXiv Detail & Related papers (2020-06-15T18:31:02Z)
Fractional Underdamped Langevin Dynamics: Retargeting SGD with Momentum under Heavy-Tailed Gradient Noise [39.9241638707715]
We show that FULD has similarities with enatural and egradient methods on their role in deep learning.
arXiv Detail & Related papers (2020-02-13T18:04:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.