On Uniform Boundedness Properties of SGD and its Momentum Variants
- URL: http://arxiv.org/abs/2201.10245v1
- Date: Tue, 25 Jan 2022 11:34:56 GMT
- Title: On Uniform Boundedness Properties of SGD and its Momentum Variants
- Authors: Xiaoyu Wang and Mikael Johansson
- Abstract summary: We investigate uniform boundedness properties of iterates and function values along the trajectories of the gradient descent algorithm.
We show that broad families of step-sizes, including the widely used step-decay and cosine with (or without) restart step-sizes, result in uniformly bounded iterates and function values.
- Score: 38.41217525394239
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A theoretical, and potentially also practical, problem with stochastic
gradient descent is that trajectories may escape to infinity. In this note, we
investigate uniform boundedness properties of iterates and function values
along the trajectories of the stochastic gradient descent algorithm and its
important momentum variant. Under smoothness and $R$-dissipativity of the loss
function, we show that broad families of step-sizes, including the widely used
step-decay and cosine with (or without) restart step-sizes, result in uniformly
bounded iterates and function values. Several important applications that
satisfy these assumptions, including phase retrieval problems, Gaussian mixture
models and some neural network classifiers, are discussed in detail.
Related papers
- Non-asymptotic convergence analysis of the stochastic gradient
Hamiltonian Monte Carlo algorithm with discontinuous stochastic gradient with
applications to training of ReLU neural networks [8.058385158111207]
We provide a non-asymptotic analysis of the convergence of the gradient Hamiltonian Monte Carlo to a target measure in Wasserstein-1 and Wasserstein-2 distance.
To illustrate our main results, we consider numerical experiments on quantile estimation and on several problems involving ReLU neural networks relevant in finance and artificial intelligence.
arXiv Detail & Related papers (2024-09-25T17:21:09Z) - A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks.
We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks.
Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z) - Adaptive Step Sizes for Preconditioned Stochastic Gradient Descent [0.3831327965422187]
This paper proposes a novel approach to adaptive step sizes in gradient descent (SGD)
We use quantities that we have identified as numerically traceable -- the Lipschitz constant for gradients and a concept of the local variance in search directions.
arXiv Detail & Related papers (2023-11-28T17:03:56Z) - Curvature-Independent Last-Iterate Convergence for Games on Riemannian
Manifolds [77.4346324549323]
We show that a step size agnostic to the curvature of the manifold achieves a curvature-independent and linear last-iterate convergence rate.
To the best of our knowledge, the possibility of curvature-independent rates and/or last-iterate convergence has not been considered before.
arXiv Detail & Related papers (2023-06-29T01:20:44Z) - Stochastic Langevin Differential Inclusions with Applications to Machine Learning [5.274477003588407]
We show some foundational results regarding the flow and properties of Langevin-type Differential Inclusions.
In particular, we show strong existence of the solution, as well as an canonical- minimization of the free-energy functional.
arXiv Detail & Related papers (2022-06-23T08:29:17Z) - Improved Convergence Rate of Stochastic Gradient Langevin Dynamics with
Variance Reduction and its Application to Optimization [50.83356836818667]
gradient Langevin Dynamics is one of the most fundamental algorithms to solve non-eps optimization problems.
In this paper, we show two variants of this kind, namely the Variance Reduced Langevin Dynamics and the Recursive Gradient Langevin Dynamics.
arXiv Detail & Related papers (2022-03-30T11:39:00Z) - Nonconvex Stochastic Scaled-Gradient Descent and Generalized Eigenvector
Problems [98.34292831923335]
Motivated by the problem of online correlation analysis, we propose the emphStochastic Scaled-Gradient Descent (SSD) algorithm.
We bring these ideas together in an application to online correlation analysis, deriving for the first time an optimal one-time-scale algorithm with an explicit rate of local convergence to normality.
arXiv Detail & Related papers (2021-12-29T18:46:52Z) - Differentiable Annealed Importance Sampling and the Perils of Gradient
Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation.
Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective.
We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z) - Dynamical mean-field theory for stochastic gradient descent in Gaussian
mixture classification [25.898873960635534]
We analyze in a closed learning dynamics of gradient descent (SGD) for a single-layer neural network classifying a high-dimensional landscape.
We define a prototype process for which can be extended to a continuous-dimensional gradient flow.
In the full-batch limit, we recover the standard gradient flow.
arXiv Detail & Related papers (2020-06-10T22:49:41Z) - The Heavy-Tail Phenomenon in SGD [7.366405857677226]
We show that depending on the structure of the Hessian of the loss at the minimum, the SGD iterates will converge to a emphheavy-tailed stationary distribution.
We translate our results into insights about the behavior of SGD in deep learning.
arXiv Detail & Related papers (2020-06-08T16:43:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.