How to trap a gradient flow
- URL: http://arxiv.org/abs/2001.02968v3
- Date: Wed, 30 Dec 2020 15:36:23 GMT
- Title: How to trap a gradient flow
- Authors: S\'ebastien Bubeck and Dan Mikulincer
- Abstract summary: We consider the problem of finding an $varepsilon$-approximate stationary point of a smooth function on a compact domain of $mathbbRd$.
Our main contribution is an algorithm, which we call gradient flow trapping (GFT), and the analysis of its oracle complexity.
- Score: 3.198144010381572
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the problem of finding an $\varepsilon$-approximate stationary
point of a smooth function on a compact domain of $\mathbb{R}^d$. In contrast
with dimension-free approaches such as gradient descent, we focus here on the
case where $d$ is finite, and potentially small. This viewpoint was explored in
1993 by Vavasis, who proposed an algorithm which, for any fixed finite
dimension $d$, improves upon the $O(1/\varepsilon^2)$ oracle complexity of
gradient descent. For example for $d=2$, Vavasis' approach obtains the
complexity $O(1/\varepsilon)$. Moreover for $d=2$ he also proved a lower bound
of $\Omega(1/\sqrt{\varepsilon})$ for deterministic algorithms (we extend this
result to randomized algorithms).
Our main contribution is an algorithm, which we call gradient flow trapping
(GFT), and the analysis of its oracle complexity. In dimension $d=2$, GFT
closes the gap with Vavasis' lower bound (up to a logarithmic factor), as we
show that it has complexity
$O\left(\sqrt{\frac{\log(1/\varepsilon)}{\varepsilon}}\right)$. In dimension
$d=3$, we show a complexity of
$O\left(\frac{\log(1/\varepsilon)}{\varepsilon}\right)$, improving upon
Vavasis' $O\left(1 / \varepsilon^{1.2} \right)$. In higher dimensions, GFT has
the remarkable property of being a logarithmic parallel depth strategy, in
stark contrast with the polynomial depth of gradient descent or Vavasis'
algorithm. In this higher dimensional regime, the total work of GFT improves
quadratically upon the only other known polylogarithmic depth strategy for this
problem, namely naive grid search. We augment this result with another
algorithm, named \emph{cut and flow} (CF), which improves upon Vavasis'
algorithm in any fixed dimension.
Related papers
- Efficiently Learning One-Hidden-Layer ReLU Networks via Schur
Polynomials [50.90125395570797]
We study the problem of PAC learning a linear combination of $k$ ReLU activations under the standard Gaussian distribution on $mathbbRd$ with respect to the square loss.
Our main result is an efficient algorithm for this learning task with sample and computational complexity $(dk/epsilon)O(k)$, whereepsilon>0$ is the target accuracy.
arXiv Detail & Related papers (2023-07-24T14:37:22Z) - Memory-Constrained Algorithms for Convex Optimization via Recursive
Cutting-Planes [23.94542304111204]
First class of algorithms that provides a positive trade-off between gradient descent and cutting-plane methods in any regime with $epsilonleq 1/sqrt d$.
In the regime $epsilon leq d-Omega(d)$, our algorithm with $p=d$ achieves the information-theoretic optimal memory usage and improves the oracle-complexity of gradient descent.
arXiv Detail & Related papers (2023-06-16T17:00:51Z) - ReSQueing Parallel and Private Stochastic Convex Optimization [59.53297063174519]
We introduce a new tool for BFG convex optimization (SCO): a Reweighted Query (ReSQue) estimator for the gradient of a function convolved with a (Gaussian) probability density.
We develop algorithms achieving state-of-the-art complexities for SCO in parallel and private settings.
arXiv Detail & Related papers (2023-01-01T18:51:29Z) - Lifted Primal-Dual Method for Bilinearly Coupled Smooth Minimax
Optimization [47.27237492375659]
We study the bilinearly coupled minimax problem: $min_x max_y f(x) + ytop A x - h(y)$, where $f$ and $h$ are both strongly convex smooth functions.
No known first-order algorithms have hitherto achieved the lower complexity bound of $Omega(sqrtfracL_xmu_x + frac|A|sqrtmu_x,mu_y) log(frac1vareps
arXiv Detail & Related papers (2022-01-19T05:56:19Z) - Escape saddle points by a simple gradient-descent based algorithm [6.7298812735467095]
Escaping saddle points is a central research topic in non optimization.
We propose a simple gradient-based algorithm such that for a smooth function $fcolonmathbbRntomathbbR$, it outputs an $epsilonO(log n)2/epsilon4)$.
Technically, our main contribution is an idea of implementing a robust Hessian power method using only gradients.
arXiv Detail & Related papers (2021-11-28T07:32:54Z) - Private Stochastic Convex Optimization: Optimal Rates in $\ell_1$
Geometry [69.24618367447101]
Up to logarithmic factors the optimal excess population loss of any $(varepsilon,delta)$-differently private is $sqrtlog(d)/n + sqrtd/varepsilon n.$
We show that when the loss functions satisfy additional smoothness assumptions, the excess loss is upper bounded (up to logarithmic factors) by $sqrtlog(d)/n + (log(d)/varepsilon n)2/3.
arXiv Detail & Related papers (2021-03-02T06:53:44Z) - Hybrid Stochastic-Deterministic Minibatch Proximal Gradient:
Less-Than-Single-Pass Optimization with Nearly Optimal Generalization [83.80460802169999]
We show that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
For loss factors, we prove that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
arXiv Detail & Related papers (2020-09-18T02:18:44Z) - Streaming Complexity of SVMs [110.63976030971106]
We study the space complexity of solving the bias-regularized SVM problem in the streaming model.
We show that for both problems, for dimensions of $frac1lambdaepsilon$, one can obtain streaming algorithms with spacely smaller than $frac1lambdaepsilon$.
arXiv Detail & Related papers (2020-07-07T17:10:00Z) - Agnostic Q-learning with Function Approximation in Deterministic
Systems: Tight Bounds on Approximation Error and Sample Complexity [94.37110094442136]
We study the problem of agnostic $Q$-learning with function approximation in deterministic systems.
We show that if $delta = Oleft(rho/sqrtdim_Eright)$, then one can find the optimal policy using $Oleft(dim_Eright)$.
arXiv Detail & Related papers (2020-02-17T18:41:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.