On the Almost Sure Convergence of Stochastic Gradient Descent in
Non-Convex Problems
- URL: http://arxiv.org/abs/2006.11144v1
- Date: Fri, 19 Jun 2020 14:11:26 GMT
- Title: On the Almost Sure Convergence of Stochastic Gradient Descent in
Non-Convex Problems
- Authors: Panayotis Mertikopoulos and Nadav Hallak and Ali Kavis and Volkan
Cevher
- Abstract summary: This paper analyzes the trajectories of gradient descent (SGD)
We show that SGD avoids saddle points/manifolds with $1$ for strict step-size policies.
- Score: 75.58134963501094
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper analyzes the trajectories of stochastic gradient descent (SGD) to
help understand the algorithm's convergence properties in non-convex problems.
We first show that the sequence of iterates generated by SGD remains bounded
and converges with probability $1$ under a very broad range of step-size
schedules. Subsequently, going beyond existing positive probability guarantees,
we show that SGD avoids strict saddle points/manifolds with probability $1$ for
the entire spectrum of step-size policies considered. Finally, we prove that
the algorithm's rate of convergence to Hurwicz minimizers is
$\mathcal{O}(1/n^{p})$ if the method is employed with a $\Theta(1/n^p)$
step-size schedule. This provides an important guideline for tuning the
algorithm's step-size as it suggests that a cool-down phase with a vanishing
step-size could lead to faster convergence; we demonstrate this heuristic using
ResNet architectures on CIFAR.
Related papers
- Nonasymptotic Analysis of Stochastic Gradient Descent with the Richardson-Romberg Extrapolation [22.652143194356864]
We address the problem of solving strongly convex and smooth problems using a descent gradient (SGD) algorithm with a constant step size.
We provide an expansion of the mean-squared error of the resulting estimator with respect to the number iterations of $n$.
We establish that this chain is geometrically ergodic with respect to a defined weighted Wasserstein semimetric.
arXiv Detail & Related papers (2024-10-07T15:02:48Z) - On Convergence of Incremental Gradient for Non-Convex Smooth Functions [63.51187646914962]
In machine learning and network optimization, algorithms like shuffle SGD are popular due to minimizing the number of misses and good cache.
This paper delves into the convergence properties SGD algorithms with arbitrary data ordering.
arXiv Detail & Related papers (2023-05-30T17:47:27Z) - Towards Noise-adaptive, Problem-adaptive Stochastic Gradient Descent [7.176107039687231]
We design step-size schemes that make gradient descent (SGD) adaptive to (i) the noise.
We prove that $T$ iterations of SGD with Nesterov iterations can be near optimal.
Compared to other step-size schemes, we demonstrate the effectiveness of a novel novel exponential step-size scheme.
arXiv Detail & Related papers (2021-10-21T19:22:14Z) - Last iterate convergence of SGD for Least-Squares in the Interpolation
regime [19.05750582096579]
We study the noiseless model in the fundamental least-squares setup.
We assume that an optimum predictor fits perfectly inputs and outputs $langle theta_*, phi(X) rangle = Y$, where $phi(X)$ stands for a possibly infinite dimensional non-linear feature map.
arXiv Detail & Related papers (2021-02-05T14:02:20Z) - ROOT-SGD: Sharp Nonasymptotics and Near-Optimal Asymptotics in a Single Algorithm [71.13558000599839]
We study the problem of solving strongly convex and smooth unconstrained optimization problems using first-order algorithms.
We devise a novel, referred to as Recursive One-Over-T SGD, based on an easily implementable, averaging of past gradients.
We prove that it simultaneously achieves state-of-the-art performance in both a finite-sample, nonasymptotic sense and an sense.
arXiv Detail & Related papers (2020-08-28T14:46:56Z) - Balancing Rates and Variance via Adaptive Batch-Size for Stochastic
Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error.
Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z) - Almost sure convergence rates for Stochastic Gradient Descent and
Stochastic Heavy Ball [17.33867778750777]
We study gradient descent (SGD) and the heavy ball method (SHB) for the general approximation problem.
For SGD, in the convex and smooth setting, we provide the first emphalmost sure convergence emphrates for a weighted average of the iterates.
arXiv Detail & Related papers (2020-06-14T11:12:05Z) - A Simple Convergence Proof of Adam and Adagrad [74.24716715922759]
We show a proof of convergence between the Adam Adagrad and $O(d(N)/st)$ algorithms.
Adam converges with the same convergence $O(d(N)/st)$ when used with the default parameters.
arXiv Detail & Related papers (2020-03-05T01:56:17Z) - Non-asymptotic Convergence of Adam-type Reinforcement Learning
Algorithms under Markovian Sampling [56.394284787780364]
This paper provides the first theoretical convergence analysis for two fundamental RL algorithms of policy gradient (PG) and temporal difference (TD) learning.
Under general nonlinear function approximation, PG-AMSGrad with a constant stepsize converges to a neighborhood of a stationary point at the rate of $mathcalO(log T/sqrtT)$.
Under linear function approximation, TD-AMSGrad with a constant stepsize converges to a neighborhood of the global optimum at the rate of $mathcalO(log T/sqrtT
arXiv Detail & Related papers (2020-02-15T00:26:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.