Convergence, Sticking and Escape: Stochastic Dynamics Near Critical Points in SGD
- URL: http://arxiv.org/abs/2505.18535v1
- Date: Sat, 24 May 2025 06:00:45 GMT
- Title: Convergence, Sticking and Escape: Stochastic Dynamics Near Critical Points in SGD
- Authors: Dmitry Dudukalov, Artem Logachov, Vladimir Lotov, Timofei Prasolov, Evgeny Prokopenko, Anton Tarasenko,
- Abstract summary: We study the convergence properties and escape dynamics of Gradient Descent in one-dimensional landscapes.<n>Our main focus is to identify the time scales on which SGD reliably moves from an initial point to the local minimum in the same ''basin''<n>Overall, our findings present a nuanced view of SGD's transitions between local maxima and minima, influenced by both noise characteristics and the underlying function geometry.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study the convergence properties and escape dynamics of Stochastic Gradient Descent (SGD) in one-dimensional landscapes, separately considering infinite- and finite-variance noise. Our main focus is to identify the time scales on which SGD reliably moves from an initial point to the local minimum in the same ''basin''. Under suitable conditions on the noise distribution, we prove that SGD converges to the basin's minimum unless the initial point lies too close to a local maximum. In that near-maximum scenario, we show that SGD can linger for a long time in its neighborhood. For initial points near a ''sharp'' maximum, we show that SGD does not remain stuck there, and we provide results to estimate the probability that it will reach each of the two neighboring minima. Overall, our findings present a nuanced view of SGD's transitions between local maxima and minima, influenced by both noise characteristics and the underlying function geometry.
Related papers
- Langevin Dynamics: A Unified Perspective on Optimization via Lyapunov Potentials [15.718093624695552]
We analyze the convergence of Gradient Langevin Dynamics (SGLD) to global minima based on Lyapunov potentials and optimization.
We provide 1) improved in the setting of previous works SGLD for optimization, 2) first finite gradient complexity for SGLD, and 3) prove if continuous-time Langevin Dynamics succeeds for optimization, then discrete-time SGLD succeeds under mild regularity assumptions.
arXiv Detail & Related papers (2024-07-05T05:34:10Z) - High-probability Convergence Bounds for Nonlinear Stochastic Gradient Descent Under Heavy-tailed Noise [59.25598762373543]
We show that wetailed high-prob convergence guarantees of learning on streaming data in the presence of heavy-tailed noise.
We demonstrate analytically and that $ta$ can be used to the preferred choice of setting for a given problem.
arXiv Detail & Related papers (2023-10-28T18:53:41Z) - Convergence of stochastic gradient descent under a local Lojasiewicz
condition for deep neural networks [7.9626223030099545]
We establish the convergence of the local local convergence with positive gradient.
We provide examples of neural networks with finite widths such that our assumptions hold.
arXiv Detail & Related papers (2023-04-18T18:20:52Z) - From Gradient Flow on Population Loss to Learning with Stochastic
Gradient Descent [50.4531316289086]
Gradient Descent (SGD) has been the method of choice for learning large-scale non-root models.
An overarching paper is providing general conditions SGD converges, assuming that GF on the population loss converges.
We provide a unified analysis for GD/SGD not only for classical settings like convex losses, but also for more complex problems including Retrieval Matrix sq-root.
arXiv Detail & Related papers (2022-10-13T03:55:04Z) - When does SGD favor flat minima? A quantitative characterization via
linear stability [7.252584656056866]
gradient descent (SGD) favors flat minima.
Property of SGD noise provably holds for linear networks and random feature models (RFMs)
arXiv Detail & Related papers (2022-07-06T12:40:09Z) - Non Asymptotic Bounds for Optimization via Online Multiplicative
Stochastic Gradient Descent [0.0]
The gradient noise of Gradient Descent (SGD) is considered to play a key role in its properties.
We show that noise classes that have the same mean and covariance structure of SGD via minibatching have similar properties.
We also establish bounds for the convergence of the M-SGD algorithm in the strongly convex regime.
arXiv Detail & Related papers (2021-12-14T02:25:43Z) - Minibatch vs Local SGD with Shuffling: Tight Convergence Bounds and
Beyond [63.59034509960994]
We study shuffling-based variants: minibatch and local Random Reshuffling, which draw gradients without replacement.
For smooth functions satisfying the Polyak-Lojasiewicz condition, we obtain convergence bounds which show that these shuffling-based variants converge faster than their with-replacement counterparts.
We propose an algorithmic modification called synchronized shuffling that leads to convergence rates faster than our lower bounds in near-homogeneous settings.
arXiv Detail & Related papers (2021-10-20T02:25:25Z) - SGD with a Constant Large Learning Rate Can Converge to Local Maxima [4.014524824655106]
We construct worst-case optimization problems illustrating that gradient descent can exhibit strange and potentially undesirable behaviors.
Specifically, we construct landscapes and data distributions such that SGD converges to local maxima.
Our results highlight the importance of simultaneously analyzing the minibatch sampling, discrete-time updates rules, and realistic landscapes.
arXiv Detail & Related papers (2021-07-25T10:12:18Z) - Understanding Long Range Memory Effects in Deep Neural Networks [10.616643031188248]
textitstochastic gradient descent (SGD) is of fundamental importance in deep learning.
In this study, we argue that SGN is neither Gaussian nor stable. Instead, we propose that SGD can be viewed as a discretization of an SDE driven by textitfractional Brownian motion (FBM)
arXiv Detail & Related papers (2021-05-05T13:54:26Z) - Dynamic of Stochastic Gradient Descent with State-Dependent Noise [84.64013284862733]
gradient descent (SGD) and its variants are mainstream methods to train deep neural networks.
We show that the covariance of the noise of SGD in the local region of the local minima is a quadratic function of the state.
We propose a novel power-law dynamic with state-dependent diffusion to approximate the dynamic of SGD.
arXiv Detail & Related papers (2020-06-24T13:34:38Z) - On the Almost Sure Convergence of Stochastic Gradient Descent in
Non-Convex Problems [75.58134963501094]
This paper analyzes the trajectories of gradient descent (SGD)
We show that SGD avoids saddle points/manifolds with $1$ for strict step-size policies.
arXiv Detail & Related papers (2020-06-19T14:11:26Z) - Stability of Stochastic Gradient Descent on Nonsmooth Convex Losses [52.039438701530905]
We provide sharp upper and lower bounds for several forms of gradient descent (SGD) on arbitrary Lipschitz nonsmooth convex losses.
Our bounds allow us to derive a new algorithm for differentially private nonsmooth convex optimization with optimal excess population risk.
arXiv Detail & Related papers (2020-06-12T02:45:21Z) - Is Local SGD Better than Minibatch SGD? [60.42437186984968]
We show how all existing error guarantees in the convex setting are dominated by a simple baseline, minibatch SGD.
We show that indeed local SGD does not dominate minibatch SGD by presenting a lower bound on the performance of local SGD that is worse than the minibatch SGD guarantee.
arXiv Detail & Related papers (2020-02-18T19:22:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.