STORM+: Fully Adaptive SGD with Momentum for Nonconvex Optimization
- URL: http://arxiv.org/abs/2111.01040v1
- Date: Mon, 1 Nov 2021 15:43:36 GMT
- Title: STORM+: Fully Adaptive SGD with Momentum for Nonconvex Optimization
- Authors: Kfir Y. Levy, Ali Kavis, Volkan Cevher
- Abstract summary: We investigate non-batch optimization problems where the objective is an expectation over smooth loss functions.
Our work builds on the STORM algorithm, in conjunction with a novel approach to adaptively set the learning rate and momentum parameters.
- Score: 74.1615979057429
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work we investigate stochastic non-convex optimization problems where
the objective is an expectation over smooth loss functions, and the goal is to
find an approximate stationary point. The most popular approach to handling
such problems is variance reduction techniques, which are also known to obtain
tight convergence rates, matching the lower bounds in this case. Nevertheless,
these techniques require a careful maintenance of anchor points in conjunction
with appropriately selected "mega-batchsizes". This leads to a challenging
hyperparameter tuning problem, that weakens their practicality. Recently,
[Cutkosky and Orabona, 2019] have shown that one can employ recursive momentum
in order to avoid the use of anchor points and large batchsizes, and still
obtain the optimal rate for this setting. Yet, their method called STORM
crucially relies on the knowledge of the smoothness, as well a bound on the
gradient norms. In this work we propose STORM+, a new method that is completely
parameter-free, does not require large batch-sizes, and obtains the optimal
$O(1/T^{1/3})$ rate for finding an approximate stationary point. Our work
builds on the STORM algorithm, in conjunction with a novel approach to
adaptively set the learning rate and momentum parameters.
Related papers
- Accelerated Parameter-Free Stochastic Optimization [28.705054104155973]
We propose a method that achieves near-optimal rates for smooth convex optimization.
It requires essentially no prior knowledge of problem parameters.
Our experiments show consistent, strong performance on convex problems and mixed results on neural network training.
arXiv Detail & Related papers (2024-03-31T12:21:57Z) - Extra-Newton: A First Approach to Noise-Adaptive Accelerated
Second-Order Methods [57.050204432302195]
This work proposes a universal and adaptive second-order method for minimizing second-order smooth, convex functions.
Our algorithm achieves $O(sigma / sqrtT)$ convergence when the oracle feedback is with variance $sigma2$, and improves its convergence to $O( 1 / T3)$ with deterministic oracles.
arXiv Detail & Related papers (2022-11-03T14:12:51Z) - META-STORM: Generalized Fully-Adaptive Variance Reduced SGD for
Unbounded Functions [23.746620619512573]
Recent work overcomes the effect of having to compute gradients of "megabatches"
Work is widely used after update with competitive deep learning tasks.
arXiv Detail & Related papers (2022-09-29T15:12:54Z) - A Momentum-Assisted Single-Timescale Stochastic Approximation Algorithm
for Bilevel Optimization [112.59170319105971]
We propose a new algorithm -- the Momentum- Single-timescale Approximation (MSTSA) -- for tackling problems.
MSTSA allows us to control the error in iterations due to inaccurate solution to the lower level subproblem.
arXiv Detail & Related papers (2021-02-15T07:10:33Z) - Balancing Rates and Variance via Adaptive Batch-Size for Stochastic
Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error.
Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z) - Convergence of adaptive algorithms for weakly convex constrained
optimization [59.36386973876765]
We prove the $mathcaltilde O(t-1/4)$ rate of convergence for the norm of the gradient of Moreau envelope.
Our analysis works with mini-batch size of $1$, constant first and second order moment parameters, and possibly smooth optimization domains.
arXiv Detail & Related papers (2020-06-11T17:43:19Z) - Support recovery and sup-norm convergence rates for sparse pivotal
estimation [79.13844065776928]
In high dimensional sparse regression, pivotal estimators are estimators for which the optimal regularization parameter is independent of the noise level.
We show minimax sup-norm convergence rates for non smoothed and smoothed, single task and multitask square-root Lasso-type estimators.
arXiv Detail & Related papers (2020-01-15T16:11:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.