Related papers: Convergence of Stochastic Gradient Langevin Dynamics in the Lazy Training Regime

Convergence of Stochastic Gradient Langevin Dynamics in the Lazy Training Regime

URL: http://arxiv.org/abs/2510.21245v1
Date: Fri, 24 Oct 2025 08:28:53 GMT
Title: Convergence of Stochastic Gradient Langevin Dynamics in the Lazy Training Regime
Authors: Noah Oberweis, Semih Cayci,
Abstract summary: Continuoustime models provide insights into the training dynamics of optimization algorithms in deep learning.<n>We establish a non-asymptotic convergence analysis of gradient Langevin dynamics (SGLD)<n>We show that, under regularity conditions on the Hessian of the loss function, SGLD with multiplicative and state-dependent noise yields a non-degenerate kernel throughout the training process with high probability.
Score: 4.297070083645049
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Continuous-time models provide important insights into the training dynamics of optimization algorithms in deep learning. In this work, we establish a non-asymptotic convergence analysis of stochastic gradient Langevin dynamics (SGLD), which is an It\^o stochastic differential equation (SDE) approximation of stochastic gradient descent in continuous time, in the lazy training regime. We show that, under regularity conditions on the Hessian of the loss function, SGLD with multiplicative and state-dependent noise (i) yields a non-degenerate kernel throughout the training process with high probability, and (ii) achieves exponential convergence to the empirical risk minimizer in expectation, and we establish finite-time and finite-width bounds on the optimality gap. We corroborate our theoretical findings with numerical examples in the regression setting.

Related papers

Algorithmic Stability of Stochastic Gradient Descent with Momentum under Heavy-Tailed Noise [20.922456964393213]
We establish generalization bounds for SGD with momentum (SGDm) under heavy-tailed noise.<n>For quadratic loss functions, we show that SGDm admits a worse generalization bound in the presence of momentum and heavy tails.<n>We develop a uniform-in-time discretization error bound, which to our knowledge, is the first result of its kind for SDEs with degenerate noise.
arXiv Detail & Related papers (2025-02-02T19:25:48Z)
Momentum Does Not Reduce Stochastic Noise in Stochastic Gradient Descent [0.6906005491572401]
In neural deep networks, gradient descent (SGD) with momentum is said to converge faster and have better generalizability than SGD without momentum.<n>In particular, adding momentum is thought to reduce this batch noise.<n>We analyzed the effect of search direction noise, which is noise defined as the error between the search direction and the steepest descent direction.
arXiv Detail & Related papers (2024-02-04T02:48:28Z)
Convergence of mean-field Langevin dynamics: Time and space discretization, stochastic gradient, and variance reduction [49.66486092259376]
The mean-field Langevin dynamics (MFLD) is a nonlinear generalization of the Langevin dynamics that incorporates a distribution-dependent drift. Recent works have shown that MFLD globally minimizes an entropy-regularized convex functional in the space of measures. We provide a framework to prove a uniform-in-time propagation of chaos for MFLD that takes into account the errors due to finite-particle approximation, time-discretization, and gradient approximation.
arXiv Detail & Related papers (2023-06-12T16:28:11Z)
Implicit Bias of Gradient Descent for Logistic Regression at the Edge of Stability [69.01076284478151]
In machine learning optimization, gradient descent (GD) often operates at the edge of stability (EoS) This paper studies the convergence and implicit bias of constant-stepsize GD for logistic regression on linearly separable data in the EoS regime.
arXiv Detail & Related papers (2023-05-19T16:24:47Z)
Exponential convergence rates for momentum stochastic gradient descent in the overparametrized setting [0.6445605125467574]
We prove bounds on the rate of convergence for the momentum gradient descent scheme (MSGD) We analyze the optimal choice of the friction and show that the MSGD process almost surely converges to a local.
arXiv Detail & Related papers (2023-02-07T15:59:08Z)
Stability and Generalization Analysis of Gradient Methods for Shallow Neural Networks [59.142826407441106]
We study the generalization behavior of shallow neural networks (SNNs) by leveraging the concept of algorithmic stability. We consider gradient descent (GD) and gradient descent (SGD) to train SNNs, for both of which we develop consistent excess bounds.
arXiv Detail & Related papers (2022-09-19T18:48:00Z)
Losing momentum in continuous-time stochastic optimisation [42.617042045455506]
momentum-based optimisation algorithms have become particularly widespread. In this work, we analyse a continuous-time model for gradient descent with momentum. We also train a convolutional neural network in an image classification problem.
arXiv Detail & Related papers (2022-09-08T10:46:05Z)
Convex Analysis of the Mean Field Langevin Dynamics [49.66486092259375]
convergence rate analysis of the mean field Langevin dynamics is presented. $p_q$ associated with the dynamics allows us to develop a convergence theory parallel to classical results in convex optimization.
arXiv Detail & Related papers (2022-01-25T17:13:56Z)
On the Convergence of Stochastic Extragradient for Bilinear Games with Restarted Iteration Averaging [96.13485146617322]
We present an analysis of the ExtraGradient (SEG) method with constant step size, and present variations of the method that yield favorable convergence. We prove that when augmented with averaging, SEG provably converges to the Nash equilibrium, and such a rate is provably accelerated by incorporating a scheduled restarting procedure.
arXiv Detail & Related papers (2021-06-30T17:51:36Z)
A Contour Stochastic Gradient Langevin Dynamics Algorithm for Simulations of Multi-modal Distributions [17.14287157979558]
We propose an adaptively weighted gradient Langevin dynamics (SGLD) for learning in big data statistics. The proposed algorithm is tested on benchmark datasets including CIFAR100.
arXiv Detail & Related papers (2020-10-19T19:20:47Z)
On Learning Rates and Schr\"odinger Operators [105.32118775014015]
We present a general theoretical analysis of the effect of the learning rate. We find that the learning rate tends to zero for a broad non- neural class functions.
arXiv Detail & Related papers (2020-04-15T09:52:37Z)
Convergence rates and approximation results for SGD and its continuous-time counterpart [16.70533901524849]
This paper proposes a thorough theoretical analysis of convex Gradient Descent (SGD) with non-increasing step sizes. First, we show that the SGD can be provably approximated by solutions of inhomogeneous Differential Equation (SDE) using coupling. Recent analyses of deterministic and optimization methods by their continuous counterpart, we study the long-time behavior of the continuous processes at hand and non-asymptotic bounds.
arXiv Detail & Related papers (2020-04-08T18:31:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.