Fractional Underdamped Langevin Dynamics: Retargeting SGD with Momentum
under Heavy-Tailed Gradient Noise
- URL: http://arxiv.org/abs/2002.05685v2
- Date: Wed, 4 Nov 2020 16:17:37 GMT
- Title: Fractional Underdamped Langevin Dynamics: Retargeting SGD with Momentum
under Heavy-Tailed Gradient Noise
- Authors: Umut \c{S}im\c{s}ekli, Lingjiong Zhu, Yee Whye Teh, Mert
G\"urb\"uzbalaban
- Abstract summary: We show that FULD has similarities with enatural and egradient methods on their role in deep learning.
- Score: 39.9241638707715
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Stochastic gradient descent with momentum (SGDm) is one of the most popular
optimization algorithms in deep learning. While there is a rich theory of SGDm
for convex problems, the theory is considerably less developed in the context
of deep learning where the problem is non-convex and the gradient noise might
exhibit a heavy-tailed behavior, as empirically observed in recent studies. In
this study, we consider a \emph{continuous-time} variant of SGDm, known as the
underdamped Langevin dynamics (ULD), and investigate its asymptotic properties
under heavy-tailed perturbations. Supported by recent studies from statistical
physics, we argue both theoretically and empirically that the heavy-tails of
such perturbations can result in a bias even when the step-size is small, in
the sense that \emph{the optima of stationary distribution} of the dynamics
might not match \emph{the optima of the cost function to be optimized}. As a
remedy, we develop a novel framework, which we coin as \emph{fractional} ULD
(FULD), and prove that FULD targets the so-called Gibbs distribution, whose
optima exactly match the optima of the original cost. We observe that the Euler
discretization of FULD has noteworthy algorithmic similarities with
\emph{natural gradient} methods and \emph{gradient clipping}, bringing a new
perspective on understanding their role in deep learning. We support our theory
with experiments conducted on a synthetic model and neural networks.
Related papers
- Role of Momentum in Smoothing Objective Function and Generalizability of Deep Neural Networks [0.6906005491572401]
We show that noise in gradient descent (SGD) with momentum smoothes the objective function, the degree of which is determined by the learning rate, the batch size, the momentum factor, and the upper bound of the norm.
We also provide experimental results supporting our assertion model generalizability depends on the noise level.
arXiv Detail & Related papers (2024-02-04T02:48:28Z) - Accelerated Convergence of Stochastic Heavy Ball Method under Anisotropic Gradient Noise [16.12834917344859]
It is widely conjectured that heavy-ball momentum method can provide accelerated convergence and should work well in large batch settings.
We show that heavy-ball momentum can provide $tildemathcalO(sqrtkappa)$ accelerated convergence of the bias term of SGD while still achieving near-optimal convergence rate.
This means SGD with heavy-ball momentum is useful in the large-batch settings such as distributed machine learning or federated learning.
arXiv Detail & Related papers (2023-12-22T09:58:39Z) - On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics.
The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z) - The Marginal Value of Momentum for Small Learning Rate SGD [20.606430391298815]
Momentum is known to accelerate the convergence of gradient descent in strongly convex settings without gradient noise regimes.
Experiments show that momentum indeed has limited benefits for both optimization and generalization in practical training where the optimal learning rate is not very large.
arXiv Detail & Related papers (2023-07-27T21:01:26Z) - Convergence of mean-field Langevin dynamics: Time and space
discretization, stochastic gradient, and variance reduction [49.66486092259376]
The mean-field Langevin dynamics (MFLD) is a nonlinear generalization of the Langevin dynamics that incorporates a distribution-dependent drift.
Recent works have shown that MFLD globally minimizes an entropy-regularized convex functional in the space of measures.
We provide a framework to prove a uniform-in-time propagation of chaos for MFLD that takes into account the errors due to finite-particle approximation, time-discretization, and gradient approximation.
arXiv Detail & Related papers (2023-06-12T16:28:11Z) - Optimizing Information-theoretical Generalization Bounds via Anisotropic
Noise in SGLD [73.55632827932101]
We optimize the information-theoretical generalization bound by manipulating the noise structure in SGLD.
We prove that with constraint to guarantee low empirical risk, the optimal noise covariance is the square root of the expected gradient covariance.
arXiv Detail & Related papers (2021-10-26T15:02:27Z) - Noise and Fluctuation of Finite Learning Rate Stochastic Gradient
Descent [3.0079490585515343]
gradient descent (SGD) is relatively well understood in the vanishing learning rate regime.
We propose to study the basic properties of SGD and its variants in the non-vanishing learning rate regime.
arXiv Detail & Related papers (2020-12-07T12:31:43Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - Dynamic of Stochastic Gradient Descent with State-Dependent Noise [84.64013284862733]
gradient descent (SGD) and its variants are mainstream methods to train deep neural networks.
We show that the covariance of the noise of SGD in the local region of the local minima is a quadratic function of the state.
We propose a novel power-law dynamic with state-dependent diffusion to approximate the dynamic of SGD.
arXiv Detail & Related papers (2020-06-24T13:34:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.