Escaping Saddle Points with Stochastically Controlled Stochastic
Gradient Methods
- URL: http://arxiv.org/abs/2103.04413v1
- Date: Sun, 7 Mar 2021 18:09:43 GMT
- Title: Escaping Saddle Points with Stochastically Controlled Stochastic
Gradient Methods
- Authors: Guannan Liang, Qianqian Tong, Chunjiang Zhu, Jinbo bi
- Abstract summary: We show that a first-order saddle gradient descent (SCSG) method can be perturbed by noise or a step.
A separate step is proposed to help solve this problem.
The proposed step is designed to incorporate the proposed CNC-SCSGD method further for saddle points.
- Score: 12.173568611144626
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Stochastically controlled stochastic gradient (SCSG) methods have been proved
to converge efficiently to first-order stationary points which, however, can be
saddle points in nonconvex optimization. It has been observed that a stochastic
gradient descent (SGD) step introduces anistropic noise around saddle points
for deep learning and non-convex half space learning problems, which indicates
that SGD satisfies the correlated negative curvature (CNC) condition for these
problems. Therefore, we propose to use a separate SGD step to help the SCSG
method escape from strict saddle points, resulting in the CNC-SCSG method. The
SGD step plays a role similar to noise injection but is more stable. We prove
that the resultant algorithm converges to a second-order stationary point with
a convergence rate of $\tilde{O}( \epsilon^{-2} log( 1/\epsilon))$ where
$\epsilon$ is the pre-specified error tolerance. This convergence rate is
independent of the problem dimension, and is faster than that of CNC-SGD. A
more general framework is further designed to incorporate the proposed CNC-SCSG
into any first-order method for the method to escape saddle points. Simulation
studies illustrate that the proposed algorithm can escape saddle points in much
fewer epochs than the gradient descent methods perturbed by either noise
injection or a SGD step.
Related papers
- Convergence Analysis of Adaptive Gradient Methods under Refined Smoothness and Noise Assumptions [18.47705532817026]
We show that AdaGrad outperforms SGD by a factor of $d$ under certain conditions.
Motivated by this, we introduce assumptions on the smoothness structure of the objective and the gradient variance.
arXiv Detail & Related papers (2024-06-07T02:55:57Z) - Switch and Conquer: Efficient Algorithms By Switching Stochastic
Gradient Oracles For Decentralized Saddle Point Problems [1.2277343096128712]
We develop an inexact primal hybrid gradient (inexact PDHG) procedure that allows generic gradient oracles to update the primal and dual variables.
We show that utilizing the best convergence phases of GSG and SVRG oracles makes C-DPSSG well suited for obtaining solutions of low/medium accuracy faster.
arXiv Detail & Related papers (2023-09-02T17:48:42Z) - Utilising the CLT Structure in Stochastic Gradient based Sampling :
Improved Analysis and Faster Algorithms [14.174806471635403]
We consider approximations of sampling algorithms, such as Gradient Langevin Dynamics (SGLD) and the Random Batch Method (RBM) for Interacting Particle Dynamcs (IPD)
We observe that the noise introduced by the approximation is nearly Gaussian due to the Central Limit Theorem (CLT) while the driving Brownian motion is exactly Gaussian.
We harness this structure to absorb the approximation error inside the diffusion process, and obtain improved convergence guarantees for these algorithms.
arXiv Detail & Related papers (2022-06-08T10:17:40Z) - Towards Noise-adaptive, Problem-adaptive Stochastic Gradient Descent [7.176107039687231]
We design step-size schemes that make gradient descent (SGD) adaptive to (i) the noise.
We prove that $T$ iterations of SGD with Nesterov iterations can be near optimal.
Compared to other step-size schemes, we demonstrate the effectiveness of a novel novel exponential step-size scheme.
arXiv Detail & Related papers (2021-10-21T19:22:14Z) - Differentiable Annealed Importance Sampling and the Perils of Gradient
Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation.
Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective.
We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z) - Stochastic Gradient Descent-Ascent and Consensus Optimization for Smooth
Games: Convergence Analysis under Expected Co-coercivity [49.66890309455787]
We introduce the expected co-coercivity condition, explain its benefits, and provide the first last-iterate convergence guarantees of SGDA and SCO.
We prove linear convergence of both methods to a neighborhood of the solution when they use constant step-size.
Our convergence guarantees hold under the arbitrary sampling paradigm, and we give insights into the complexity of minibatching.
arXiv Detail & Related papers (2021-06-30T18:32:46Z) - On the Convergence of Stochastic Extragradient for Bilinear Games with
Restarted Iteration Averaging [96.13485146617322]
We present an analysis of the ExtraGradient (SEG) method with constant step size, and present variations of the method that yield favorable convergence.
We prove that when augmented with averaging, SEG provably converges to the Nash equilibrium, and such a rate is provably accelerated by incorporating a scheduled restarting procedure.
arXiv Detail & Related papers (2021-06-30T17:51:36Z) - Faster Convergence of Stochastic Gradient Langevin Dynamics for
Non-Log-Concave Sampling [110.88857917726276]
We provide a new convergence analysis of gradient Langevin dynamics (SGLD) for sampling from a class of distributions that can be non-log-concave.
At the core of our approach is a novel conductance analysis of SGLD using an auxiliary time-reversible Markov Chain.
arXiv Detail & Related papers (2020-10-19T15:23:18Z) - On the Almost Sure Convergence of Stochastic Gradient Descent in
Non-Convex Problems [75.58134963501094]
This paper analyzes the trajectories of gradient descent (SGD)
We show that SGD avoids saddle points/manifolds with $1$ for strict step-size policies.
arXiv Detail & Related papers (2020-06-19T14:11:26Z) - Stability of Stochastic Gradient Descent on Nonsmooth Convex Losses [52.039438701530905]
We provide sharp upper and lower bounds for several forms of gradient descent (SGD) on arbitrary Lipschitz nonsmooth convex losses.
Our bounds allow us to derive a new algorithm for differentially private nonsmooth convex optimization with optimal excess population risk.
arXiv Detail & Related papers (2020-06-12T02:45:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.