Restarted Nonconvex Accelerated Gradient Descent: No More
Polylogarithmic Factor in the $O(\epsilon^{-7/4})$ Complexity
- URL: http://arxiv.org/abs/2201.11411v4
- Date: Wed, 26 Apr 2023 02:17:13 GMT
- Title: Restarted Nonconvex Accelerated Gradient Descent: No More
Polylogarithmic Factor in the $O(\epsilon^{-7/4})$ Complexity
- Authors: Huan Li and Zhouchen Lin
- Abstract summary: We propose two simple accelerated gradient methods, restarted gradient descent (AGD) and restarted ball (HB) method.
We establish that our methods achieve an $frac1epsilon)$ number of gradient iterations.
Our algorithms are simple in the sense that they only consist of Nestov's classical AGD orak's HB as well as a restart mechanism.
- Score: 70.65867695317633
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper studies accelerated gradient methods for nonconvex optimization
with Lipschitz continuous gradient and Hessian. We propose two simple
accelerated gradient methods, restarted accelerated gradient descent (AGD) and
restarted heavy ball (HB) method, and establish that our methods achieve an
$\epsilon$-approximate first-order stationary point within $O(\epsilon^{-7/4})$
number of gradient evaluations by elementary proofs. Theoretically, our
complexity does not hide any polylogarithmic factors, and thus it improves over
the best known one by the $O(\log\frac{1}{\epsilon})$ factor. Our algorithms
are simple in the sense that they only consist of Nesterov's classical AGD or
Polyak's HB iterations, as well as a restart mechanism. They do not invoke
negative curvature exploitation or minimization of regularized surrogate
functions as the subroutines. In contrast with existing analysis, our
elementary proofs use less advanced techniques and do not invoke the analysis
of strongly convex AGD or HB.
Code is avaliable at https://github.com/lihuanML/RestartAGD.
Related papers
- Generalization Bounds for Gradient Methods via Discrete and Continuous
Prior [8.76346911214414]
We show a new high probability generalization bound of order $O(frac1n + fracL2n2sum_t=1T(gamma_t/varepsilon_t)2)$ for gradient Langevin Dynamics (GLD)
We can also obtain new bounds for certain variants of SGD.
arXiv Detail & Related papers (2022-05-27T07:23:01Z) - ProxSkip: Yes! Local Gradient Steps Provably Lead to Communication
Acceleration! Finally! [6.170898159041277]
We introduce ProxSkip, a surprisingly simple and provably efficient method for minimizing the sum of a smooth ($f$) and an expensive nonsmooth ($psi$) function.
Our main motivation comes from federated learning, where evaluation of the gradient operator corresponds to taking a local GD step independently on all devices.
arXiv Detail & Related papers (2022-02-18T18:56:05Z) - Escape saddle points by a simple gradient-descent based algorithm [6.7298812735467095]
Escaping saddle points is a central research topic in non optimization.
We propose a simple gradient-based algorithm such that for a smooth function $fcolonmathbbRntomathbbR$, it outputs an $epsilonO(log n)2/epsilon4)$.
Technically, our main contribution is an idea of implementing a robust Hessian power method using only gradients.
arXiv Detail & Related papers (2021-11-28T07:32:54Z) - Faster Perturbed Stochastic Gradient Methods for Finding Local Minima [92.99933928528797]
We propose tttPullback, a faster perturbed gradient framework for finding local minima.
We show that Pullback with gradient estimators such as SARAH/SP and STORM can find $(epsilon, epsilon_H)$approximate local minima within $tilde O(epsilon-3 + H-6)$.
The core idea of our framework is a step-size pullback'' scheme to control the average movement of the gradient evaluations.
arXiv Detail & Related papers (2021-10-25T07:20:05Z) - Correcting Momentum with Second-order Information [50.992629498861724]
We develop a new algorithm for non-critical optimization that finds an $O(epsilon)$epsilon point in the optimal product.
We validate our results on a variety of large-scale deep learning benchmarks and architectures.
arXiv Detail & Related papers (2021-03-04T19:01:20Z) - Gradient Free Minimax Optimization: Variance Reduction and Faster
Convergence [120.9336529957224]
In this paper, we denote the non-strongly setting on the magnitude of a gradient-free minimax optimization problem.
We show that a novel zeroth-order variance reduced descent algorithm achieves the best known query complexity.
arXiv Detail & Related papers (2020-06-16T17:55:46Z) - Complexity of Finding Stationary Points of Nonsmooth Nonconvex Functions [84.49087114959872]
We provide the first non-asymptotic analysis for finding stationary points of nonsmooth, nonsmooth functions.
In particular, we study Hadamard semi-differentiable functions, perhaps the largest class of nonsmooth functions.
arXiv Detail & Related papers (2020-02-10T23:23:04Z) - Towards Better Understanding of Adaptive Gradient Algorithms in
Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks.
In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems.
Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z) - Near-Optimal Methods for Minimizing Star-Convex Functions and Beyond [25.845034951660544]
We show that no first-order method can improve upon ours.
We develop a variant accelerated descent that computes an $$quasar alog function.
No first-order method can improve upon ours.
arXiv Detail & Related papers (2019-06-27T22:39:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.