Frank Wolfe Meets Metric Entropy
- URL: http://arxiv.org/abs/2205.08634v1
- Date: Tue, 17 May 2022 21:23:36 GMT
- Title: Frank Wolfe Meets Metric Entropy
- Authors: Suhas Vijaykumar
- Abstract summary: We provide a technique for establishing domain specific and easy-to-estimate lower bounds for Frank-Wolfe.
We show that a dimension-free linear upper bound must fail not only in the worst case, but in the emph average case.
We also establish this phenomenon for the nuclear norm ball.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Frank-Wolfe algorithm has seen a resurgence in popularity due to its
ability to efficiently solve constrained optimization problems in machine
learning and high-dimensional statistics. As such, there is much interest in
establishing when the algorithm may possess a "linear" $O(\log(1/\epsilon))$
dimension-free iteration complexity comparable to projected gradient descent.
In this paper, we provide a general technique for establishing domain
specific and easy-to-estimate lower bounds for Frank-Wolfe and its variants
using the metric entropy of the domain. Most notably, we show that a
dimension-free linear upper bound must fail not only in the worst case, but in
the \emph{average case}: for a Gaussian or spherical random polytope in
$\mathbb{R}^d$ with $\mathrm{poly}(d)$ vertices, Frank-Wolfe requires up to
$\tilde\Omega(d)$ iterations to achieve a $O(1/d)$ error bound, with high
probability. We also establish this phenomenon for the nuclear norm ball.
The link with metric entropy also has interesting positive implications for
conditional gradient algorithms in statistics, such as gradient boosting and
matching pursuit. In particular, we show that it is possible to extract
fast-decaying upper bounds on the excess risk directly from an analysis of the
underlying optimization procedure.
Related papers
- High Probability Convergence of Stochastic Gradient Methods [15.829413808059124]
We show convergence with bounds depending on the initial distance to the optimal solution.
We demonstrate that our techniques can be used to obtain high bound for AdaGrad-Norm.
arXiv Detail & Related papers (2023-02-28T18:42:11Z) - Pseudonorm Approachability and Applications to Regret Minimization [73.54127663296906]
We convert high-dimensional $ell_infty$-approachability problems to low-dimensional pseudonorm approachability problems.
We develop an algorithmic theory of pseudonorm approachability, analogous to previous work on approachability for $ell$ and other norms.
arXiv Detail & Related papers (2023-02-03T03:19:14Z) - ReSQueing Parallel and Private Stochastic Convex Optimization [59.53297063174519]
We introduce a new tool for BFG convex optimization (SCO): a Reweighted Query (ReSQue) estimator for the gradient of a function convolved with a (Gaussian) probability density.
We develop algorithms achieving state-of-the-art complexities for SCO in parallel and private settings.
arXiv Detail & Related papers (2023-01-01T18:51:29Z) - Parameter-free Regret in High Probability with Heavy Tails [45.11667443216861]
We present new algorithms for online convex optimization over unbounded domains.
We obtain parameter-free regret in high-probability given access only to potentially heavy-tailed subgradient estimates.
arXiv Detail & Related papers (2022-10-25T21:43:44Z) - An Optimal Stochastic Algorithm for Decentralized Nonconvex Finite-sum
Optimization [25.21457349137344]
We show a proof to show DEAREST requires at most $mathcal O(+sqrtmnLvarepsilon-2)$ first-order oracle (IFO) calls and $mathcal O(Lvarepsilon-2/sqrt1-lambda_W)$ communication rounds.
arXiv Detail & Related papers (2022-10-25T11:37:11Z) - Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model.
The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z) - Fast Stochastic Composite Minimization and an Accelerated Frank-Wolfe
Algorithm under Parallelization [9.166498349629157]
We consider the problem of minimizing the sum of two convex functions.
One has Lipschitz-continuous gradients, and can be accessed via oracles, whereas the other is "simple"
We show that one can achieve an $epsilon$ primaldual gap (in expectation) in $tildeO (1/ sqrtepsilon)$ iterations.
arXiv Detail & Related papers (2022-05-25T13:01:09Z) - Gaussian Process Bandit Optimization with Few Batches [49.896920704012395]
We introduce a batch algorithm inspired by finite-arm bandit algorithms.
We show that it achieves the cumulative regret upper bound $Oast(sqrtTgamma_T)$ using $O(loglog T)$ batches within time horizon $T$.
In addition, we propose a modified version of our algorithm, and characterize how the regret is impacted by the number of batches.
arXiv Detail & Related papers (2021-10-15T00:54:04Z) - Private Stochastic Convex Optimization: Optimal Rates in $\ell_1$
Geometry [69.24618367447101]
Up to logarithmic factors the optimal excess population loss of any $(varepsilon,delta)$-differently private is $sqrtlog(d)/n + sqrtd/varepsilon n.$
We show that when the loss functions satisfy additional smoothness assumptions, the excess loss is upper bounded (up to logarithmic factors) by $sqrtlog(d)/n + (log(d)/varepsilon n)2/3.
arXiv Detail & Related papers (2021-03-02T06:53:44Z) - Hybrid Stochastic-Deterministic Minibatch Proximal Gradient:
Less-Than-Single-Pass Optimization with Nearly Optimal Generalization [83.80460802169999]
We show that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
For loss factors, we prove that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
arXiv Detail & Related papers (2020-09-18T02:18:44Z) - Boosting Frank-Wolfe by Chasing Gradients [26.042029798821375]
We propose to speed up the Frank-Wolfe algorithm by better aligning the descent direction with that of the negative gradient via a subroutine.
We demonstrate its competitive advantage both per iteration and in CPU time over the state-of-the-art in a series of computational experiments.
arXiv Detail & Related papers (2020-03-13T16:29:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.