Bounding the expected run-time of nonconvex optimization with early
stopping
- URL: http://arxiv.org/abs/2002.08856v4
- Date: Wed, 22 Jul 2020 17:56:23 GMT
- Title: Bounding the expected run-time of nonconvex optimization with early
stopping
- Authors: Thomas Flynn, Kwang Min Yu, Abid Malik, Nicolas D'Imperio, Shinjae Yoo
- Abstract summary: This work examines the convergence of gradient-based optimization algorithms that use early stopping based on a validation function.
We derive conditions that guarantee this stopping rule is well-defined, and provide bounds on the expected number of iterations and gradient evaluations needed to meet this criterion.
- Score: 2.7648976108201815
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work examines the convergence of stochastic gradient-based optimization
algorithms that use early stopping based on a validation function. The form of
early stopping we consider is that optimization terminates when the norm of the
gradient of a validation function falls below a threshold. We derive conditions
that guarantee this stopping rule is well-defined, and provide bounds on the
expected number of iterations and gradient evaluations needed to meet this
criterion. The guarantee accounts for the distance between the training and
validation sets, measured with the Wasserstein distance. We develop the
approach in the general setting of a first-order optimization algorithm, with
possibly biased update directions subject to a geometric drift condition. We
then derive bounds on the expected running time for early stopping variants of
several algorithms, including stochastic gradient descent (SGD), decentralized
SGD (DSGD), and the stochastic variance reduced gradient (SVRG) algorithm.
Finally, we consider the generalization properties of the iterate returned by
early stopping.
Related papers
- The Reinforce Policy Gradient Algorithm Revisited [7.894349646617293]
We revisit the Reinforce policy gradient algorithm from the literature.
We propose a major enhancement to the basic algorithm.
We provide a proof of convergence for this new algorithm.
arXiv Detail & Related papers (2023-10-08T04:05:13Z) - Variance reduction techniques for stochastic proximal point algorithms [5.374800961359305]
In this work, we propose the first unified study of variance reduction techniques for proximal point algorithms.
We introduce a generic proximal-based algorithm that can be specified to give the proximal version of SVRG, SAGA, and some of their variants.
Our experiments demonstrate the advantages of the proximal variance reduction methods over their gradient counterparts.
arXiv Detail & Related papers (2023-08-18T05:11:50Z) - Parameter-free projected gradient descent [0.0]
We consider the problem of minimizing a convex function over a closed convex set, with Projected Gradient Descent (PGD)
We propose a fully parameter-free version of AdaGrad, which is adaptive to the distance between the initialization and the optimum, and to the sum of the square norm of the subgradients.
Our algorithm is able to handle projection steps, does not involve restarts, reweighing along the trajectory or additional evaluations compared to the classical PGD.
arXiv Detail & Related papers (2023-05-31T07:22:44Z) - Random-reshuffled SARAH does not need a full gradient computations [61.85897464405715]
The StochAstic Recursive grAdientritHm (SARAH) algorithm is a variance reduced variant of the Gradient Descent (SGD) algorithm.
In this paper, we remove the necessity of a full gradient.
The aggregated gradients serve as an estimate of a full gradient in the SARAH algorithm.
arXiv Detail & Related papers (2021-11-26T06:00:44Z) - Distributed stochastic optimization with large delays [59.95552973784946]
One of the most widely used methods for solving large-scale optimization problems is distributed asynchronous gradient descent (DASGD)
We show that DASGD converges to a global optimal implementation model under same delay assumptions.
arXiv Detail & Related papers (2021-07-06T21:59:49Z) - High-probability Bounds for Non-Convex Stochastic Optimization with
Heavy Tails [55.561406656549686]
We consider non- Hilbert optimization using first-order algorithms for which the gradient estimates may have tails.
We show that a combination of gradient, momentum, and normalized gradient descent convergence to critical points in high-probability with best-known iteration for smooth losses.
arXiv Detail & Related papers (2021-06-28T00:17:01Z) - ROOT-SGD: Sharp Nonasymptotics and Near-Optimal Asymptotics in a Single Algorithm [71.13558000599839]
We study the problem of solving strongly convex and smooth unconstrained optimization problems using first-order algorithms.
We devise a novel, referred to as Recursive One-Over-T SGD, based on an easily implementable, averaging of past gradients.
We prove that it simultaneously achieves state-of-the-art performance in both a finite-sample, nonasymptotic sense and an sense.
arXiv Detail & Related papers (2020-08-28T14:46:56Z) - Sequential Quadratic Optimization for Nonlinear Equality Constrained
Stochastic Optimization [10.017195276758454]
It is assumed in this setting that it is intractable to compute objective function and derivative values explicitly.
An algorithm is proposed for the deterministic setting that is modeled after a state-of-the-art line-search SQP algorithm.
The results of numerical experiments demonstrate the practical performance of our proposed techniques.
arXiv Detail & Related papers (2020-07-20T23:04:26Z) - Private Stochastic Non-Convex Optimization: Adaptive Algorithms and
Tighter Generalization Bounds [72.63031036770425]
We propose differentially private (DP) algorithms for bound non-dimensional optimization.
We demonstrate two popular deep learning methods on the empirical advantages over standard gradient methods.
arXiv Detail & Related papers (2020-06-24T06:01:24Z) - Convergence of adaptive algorithms for weakly convex constrained
optimization [59.36386973876765]
We prove the $mathcaltilde O(t-1/4)$ rate of convergence for the norm of the gradient of Moreau envelope.
Our analysis works with mini-batch size of $1$, constant first and second order moment parameters, and possibly smooth optimization domains.
arXiv Detail & Related papers (2020-06-11T17:43:19Z) - Non-asymptotic bounds for stochastic optimization with biased noisy
gradient oracles [8.655294504286635]
We introduce biased gradient oracles to capture a setting where the function measurements have an estimation error.
Our proposed oracles are in practical contexts, for instance, risk measure estimation from a batch of independent and identically distributed simulation.
arXiv Detail & Related papers (2020-02-26T12:53:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.