Emergence of heavy tails in homogenized stochastic gradient descent
- URL: http://arxiv.org/abs/2402.01382v1
- Date: Fri, 2 Feb 2024 13:06:33 GMT
- Title: Emergence of heavy tails in homogenized stochastic gradient descent
- Authors: Zhe Jiao, Martin Keller-Ressel
- Abstract summary: Loss by gradient descent (SGD) leads to heavy-tailed network parameters.
We analyze a continuous diffusion approximation of SGD, called homogenized gradient descent.
We quantify the interplay between optimization parameters and the tail-index.
- Score: 1.450405446885067
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: It has repeatedly been observed that loss minimization by stochastic gradient
descent (SGD) leads to heavy-tailed distributions of neural network parameters.
Here, we analyze a continuous diffusion approximation of SGD, called
homogenized stochastic gradient descent, show that it behaves asymptotically
heavy-tailed, and give explicit upper and lower bounds on its tail-index. We
validate these bounds in numerical experiments and show that they are typically
close approximations to the empirical tail-index of SGD iterates. In addition,
their explicit form enables us to quantify the interplay between optimization
parameters and the tail-index. Doing so, we contribute to the ongoing
discussion on links between heavy tails and the generalization performance of
neural networks as well as the ability of SGD to avoid suboptimal local minima.
Related papers
- Convergence of mean-field Langevin dynamics: Time and space
discretization, stochastic gradient, and variance reduction [49.66486092259376]
The mean-field Langevin dynamics (MFLD) is a nonlinear generalization of the Langevin dynamics that incorporates a distribution-dependent drift.
Recent works have shown that MFLD globally minimizes an entropy-regularized convex functional in the space of measures.
We provide a framework to prove a uniform-in-time propagation of chaos for MFLD that takes into account the errors due to finite-particle approximation, time-discretization, and gradient approximation.
arXiv Detail & Related papers (2023-06-12T16:28:11Z) - Exponential convergence rates for momentum stochastic gradient descent in the overparametrized setting [0.6445605125467574]
We prove bounds on the rate of convergence for the momentum gradient descent scheme (MSGD)
We analyze the optimal choice of the friction and show that the MSGD process almost surely converges to a local.
arXiv Detail & Related papers (2023-02-07T15:59:08Z) - Stability and Generalization Analysis of Gradient Methods for Shallow
Neural Networks [59.142826407441106]
We study the generalization behavior of shallow neural networks (SNNs) by leveraging the concept of algorithmic stability.
We consider gradient descent (GD) and gradient descent (SGD) to train SNNs, for both of which we develop consistent excess bounds.
arXiv Detail & Related papers (2022-09-19T18:48:00Z) - Nonconvex Stochastic Scaled-Gradient Descent and Generalized Eigenvector
Problems [98.34292831923335]
Motivated by the problem of online correlation analysis, we propose the emphStochastic Scaled-Gradient Descent (SSD) algorithm.
We bring these ideas together in an application to online correlation analysis, deriving for the first time an optimal one-time-scale algorithm with an explicit rate of local convergence to normality.
arXiv Detail & Related papers (2021-12-29T18:46:52Z) - On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD)
We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting.
We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z) - Optimal Rates for Averaged Stochastic Gradient Descent under Neural
Tangent Kernel Regime [50.510421854168065]
We show that the averaged gradient descent can achieve the minimax optimal convergence rate.
We show that the target function specified by the NTK of a ReLU network can be learned at the optimal convergence rate.
arXiv Detail & Related papers (2020-06-22T14:31:37Z) - The Heavy-Tail Phenomenon in SGD [7.366405857677226]
We show that depending on the structure of the Hessian of the loss at the minimum, the SGD iterates will converge to a emphheavy-tailed stationary distribution.
We translate our results into insights about the behavior of SGD in deep learning.
arXiv Detail & Related papers (2020-06-08T16:43:56Z) - Federated Stochastic Gradient Langevin Dynamics [12.180900849847252]
gradient MCMC methods, such as gradient Langevin dynamics (SGLD), employ fast but noisy gradient estimates to enable large-scale posterior sampling.
We propose conducive gradients, a simple mechanism that combines local likelihood approximations to correct gradient updates.
We demonstrate that our approach can handle delayed communication rounds, converging to the target posterior in cases where DSGLD fails.
arXiv Detail & Related papers (2020-04-23T15:25:09Z) - On Learning Rates and Schr\"odinger Operators [105.32118775014015]
We present a general theoretical analysis of the effect of the learning rate.
We find that the learning rate tends to zero for a broad non- neural class functions.
arXiv Detail & Related papers (2020-04-15T09:52:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.