PAGE: A Simple and Optimal Probabilistic Gradient Estimator for
Nonconvex Optimization
- URL: http://arxiv.org/abs/2008.10898v3
- Date: Fri, 11 Jun 2021 21:37:35 GMT
- Title: PAGE: A Simple and Optimal Probabilistic Gradient Estimator for
Nonconvex Optimization
- Authors: Zhize Li, Hongyan Bao, Xiangliang Zhang, Peter Richt\'arik
- Abstract summary: We propose a novel ProAbistic Gradient Estorimat (fract) for nonepsOmega.
It is easy to implement as it is designed via a small PAGE to vanilla SGD.
Not only converges much faster than SGD in training but also achieves the higher accuracy.
- Score: 33.346773349614715
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a novel stochastic gradient estimator --
ProbAbilistic Gradient Estimator (PAGE) -- for nonconvex optimization. PAGE is
easy to implement as it is designed via a small adjustment to vanilla SGD: in
each iteration, PAGE uses the vanilla minibatch SGD update with probability
$p_t$ or reuses the previous gradient with a small adjustment, at a much lower
computational cost, with probability $1-p_t$. We give a simple formula for the
optimal choice of $p_t$. Moreover, we prove the first tight lower bound
$\Omega(n+\frac{\sqrt{n}}{\epsilon^2})$ for nonconvex finite-sum problems,
which also leads to a tight lower bound $\Omega(b+\frac{\sqrt{b}}{\epsilon^2})$
for nonconvex online problems, where $b:= \min\{\frac{\sigma^2}{\epsilon^2},
n\}$. Then, we show that PAGE obtains the optimal convergence results
$O(n+\frac{\sqrt{n}}{\epsilon^2})$ (finite-sum) and
$O(b+\frac{\sqrt{b}}{\epsilon^2})$ (online) matching our lower bounds for both
nonconvex finite-sum and online problems. Besides, we also show that for
nonconvex functions satisfying the Polyak-\L{}ojasiewicz (PL) condition, PAGE
can automatically switch to a faster linear convergence rate $O(\cdot\log
\frac{1}{\epsilon})$. Finally, we conduct several deep learning experiments
(e.g., LeNet, VGG, ResNet) on real datasets in PyTorch showing that PAGE not
only converges much faster than SGD in training but also achieves the higher
test accuracy, validating the optimal theoretical results and confirming the
practical superiority of PAGE.
Related papers
- Nearly Optimal Regret for Decentralized Online Convex Optimization [53.433398074919]
Decentralized online convex optimization (D-OCO) aims to minimize a sequence of global loss functions using only local computations and communications.
We develop novel D-OCO algorithms that can respectively reduce the regret bounds for convex and strongly convex functions.
Our algorithms are nearly optimal in terms of $T$, $n$, and $rho$.
arXiv Detail & Related papers (2024-02-14T13:44:16Z) - Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic
Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel.
An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost.
Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z) - (Accelerated) Noise-adaptive Stochastic Heavy-Ball Momentum [7.095058159492494]
heavy ball momentum (SHB) is commonly used to train machine learning models, and often provides empirical improvements over iterations of gradient descent.
We show that SHB can attain an accelerated acceleration when the mini-size is larger than a threshold $b* that that on the number $kappa.
arXiv Detail & Related papers (2024-01-12T18:17:28Z) - Breaking the Lower Bound with (Little) Structure: Acceleration in
Non-Convex Stochastic Optimization with Heavy-Tailed Noise [28.780192812703948]
We consider the optimization problem with smooth but not necessarily convex objectives in the heavy-tailed noise regime.
We show that one can achieve a faster rate than that dictated by the lower bound $Omega(Tfrac1-p3p-2)$ with only tiny bit of structure.
We also establish that it guarantees a high-probability convergence rate of $O(log(T/delta)Tfrac1-p3p-2)$ under a mild condition.
arXiv Detail & Related papers (2023-02-14T00:23:42Z) - Near-Optimal Non-Convex Stochastic Optimization under Generalized
Smoothness [21.865728815935665]
Two recent works established the $O(epsilon-3)$ sample complexity to obtain an $O(epsilon)$-stationary point.
However, both require a large batch size on the order of $mathrmploy(epsilon-1)$, which is not only computationally burdensome but also unsuitable for streaming applications.
In this work, we solve the prior two problems simultaneously by revisiting a simple variant of the STORM algorithm.
arXiv Detail & Related papers (2023-02-13T00:22:28Z) - ReSQueing Parallel and Private Stochastic Convex Optimization [59.53297063174519]
We introduce a new tool for BFG convex optimization (SCO): a Reweighted Query (ReSQue) estimator for the gradient of a function convolved with a (Gaussian) probability density.
We develop algorithms achieving state-of-the-art complexities for SCO in parallel and private settings.
arXiv Detail & Related papers (2023-01-01T18:51:29Z) - An Optimal Stochastic Algorithm for Decentralized Nonconvex Finite-sum
Optimization [25.21457349137344]
We show a proof to show DEAREST requires at most $mathcal O(+sqrtmnLvarepsilon-2)$ first-order oracle (IFO) calls and $mathcal O(Lvarepsilon-2/sqrt1-lambda_W)$ communication rounds.
arXiv Detail & Related papers (2022-10-25T11:37:11Z) - Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model.
The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z) - Minimax Optimal Regression over Sobolev Spaces via Laplacian
Regularization on Neighborhood Graphs [25.597646488273558]
We study the statistical properties of Laplacian smoothing, a graph-based approach to nonparametric regression.
We prove that Laplacian smoothing is manifold-adaptive.
arXiv Detail & Related papers (2021-06-03T01:20:41Z) - Private Stochastic Convex Optimization: Optimal Rates in $\ell_1$
Geometry [69.24618367447101]
Up to logarithmic factors the optimal excess population loss of any $(varepsilon,delta)$-differently private is $sqrtlog(d)/n + sqrtd/varepsilon n.$
We show that when the loss functions satisfy additional smoothness assumptions, the excess loss is upper bounded (up to logarithmic factors) by $sqrtlog(d)/n + (log(d)/varepsilon n)2/3.
arXiv Detail & Related papers (2021-03-02T06:53:44Z) - Optimal Regret Algorithm for Pseudo-1d Bandit Convex Optimization [51.23789922123412]
We study online learning with bandit feedback (i.e. learner has access to only zeroth-order oracle) where cost/reward functions admit a "pseudo-1d" structure.
We show a lower bound of $min(sqrtdT, T3/4)$ for the regret of any algorithm, where $T$ is the number of rounds.
We propose a new algorithm sbcalg that combines randomized online gradient descent with a kernelized exponential weights method to exploit the pseudo-1d structure effectively.
arXiv Detail & Related papers (2021-02-15T08:16:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.