Bounding the expected run-time of nonconvex optimization with early
stopping
- URL: http://arxiv.org/abs/2002.08856v4
- Date: Wed, 22 Jul 2020 17:56:23 GMT
- Title: Bounding the expected run-time of nonconvex optimization with early
stopping
- Authors: Thomas Flynn, Kwang Min Yu, Abid Malik, Nicolas D'Imperio, Shinjae Yoo
- Abstract summary: This work examines the convergence of gradient-based optimization algorithms that use early stopping based on a validation function.
We derive conditions that guarantee this stopping rule is well-defined, and provide bounds on the expected number of iterations and gradient evaluations needed to meet this criterion.
- Score: 2.7648976108201815
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work examines the convergence of stochastic gradient-based optimization
algorithms that use early stopping based on a validation function. The form of
early stopping we consider is that optimization terminates when the norm of the
gradient of a validation function falls below a threshold. We derive conditions
that guarantee this stopping rule is well-defined, and provide bounds on the
expected number of iterations and gradient evaluations needed to meet this
criterion. The guarantee accounts for the distance between the training and
validation sets, measured with the Wasserstein distance. We develop the
approach in the general setting of a first-order optimization algorithm, with
possibly biased update directions subject to a geometric drift condition. We
then derive bounds on the expected running time for early stopping variants of
several algorithms, including stochastic gradient descent (SGD), decentralized
SGD (DSGD), and the stochastic variance reduced gradient (SVRG) algorithm.
Finally, we consider the generalization properties of the iterate returned by
early stopping.
Related papers
- The Reinforce Policy Gradient Algorithm Revisited [7.894349646617293]
We revisit the Reinforce policy gradient algorithm from the literature.
We propose a major enhancement to the basic algorithm.
We provide a proof of convergence for this new algorithm.
arXiv Detail & Related papers (2023-10-08T04:05:13Z) - Parameter-free projected gradient descent [0.0]
We consider the problem of minimizing a convex function over a closed convex set, with Projected Gradient Descent (PGD)
We propose a fully parameter-free version of AdaGrad, which is adaptive to the distance between the initialization and the optimum, and to the sum of the square norm of the subgradients.
Our algorithm is able to handle projection steps, does not involve restarts, reweighing along the trajectory or additional evaluations compared to the classical PGD.
arXiv Detail & Related papers (2023-05-31T07:22:44Z) - Non-stationary Online Convex Optimization with Arbitrary Delays [50.46856739179311]
This paper investigates the delayed online convex optimization (OCO) in non-stationary environments.
We first propose a simple algorithm, namely DOGD, which performs a gradient descent step for each delayed gradient according to their arrival order.
We develop an improved algorithm, which reduces those dynamic regret bounds achieved by DOGD to $O(sqrtbardT(P_T+1))$.
arXiv Detail & Related papers (2023-05-20T07:54:07Z) - On Finite-Step Convergence of the Non-Greedy Algorithm and Proximal
Alternating Minimization Method with Extrapolation for $L_1$-Norm PCA [0.685316573653194]
The non-dygree (NGA) and the recently proposed proximal alternating method with PCA (PAMe) for $L_1-norm are studied.
It is shown that the iterative points generated by the modified algorithm will not change after at most $leftlceilfracFmaxtau rightrceil$ steps.
For PAMe, it is proved that the sign variable will remain constant after finitely many steps and the algorithm can a point satisfying certain optimality condition.
arXiv Detail & Related papers (2023-02-15T15:17:02Z) - Random-reshuffled SARAH does not need a full gradient computations [61.85897464405715]
The StochAstic Recursive grAdientritHm (SARAH) algorithm is a variance reduced variant of the Gradient Descent (SGD) algorithm.
In this paper, we remove the necessity of a full gradient.
The aggregated gradients serve as an estimate of a full gradient in the SARAH algorithm.
arXiv Detail & Related papers (2021-11-26T06:00:44Z) - Distributed stochastic optimization with large delays [59.95552973784946]
One of the most widely used methods for solving large-scale optimization problems is distributed asynchronous gradient descent (DASGD)
We show that DASGD converges to a global optimal implementation model under same delay assumptions.
arXiv Detail & Related papers (2021-07-06T21:59:49Z) - High-probability Bounds for Non-Convex Stochastic Optimization with
Heavy Tails [55.561406656549686]
We consider non- Hilbert optimization using first-order algorithms for which the gradient estimates may have tails.
We show that a combination of gradient, momentum, and normalized gradient descent convergence to critical points in high-probability with best-known iteration for smooth losses.
arXiv Detail & Related papers (2021-06-28T00:17:01Z) - Sequential Quadratic Optimization for Nonlinear Equality Constrained
Stochastic Optimization [10.017195276758454]
It is assumed in this setting that it is intractable to compute objective function and derivative values explicitly.
An algorithm is proposed for the deterministic setting that is modeled after a state-of-the-art line-search SQP algorithm.
The results of numerical experiments demonstrate the practical performance of our proposed techniques.
arXiv Detail & Related papers (2020-07-20T23:04:26Z) - Private Stochastic Non-Convex Optimization: Adaptive Algorithms and
Tighter Generalization Bounds [72.63031036770425]
We propose differentially private (DP) algorithms for bound non-dimensional optimization.
We demonstrate two popular deep learning methods on the empirical advantages over standard gradient methods.
arXiv Detail & Related papers (2020-06-24T06:01:24Z) - Convergence of adaptive algorithms for weakly convex constrained
optimization [59.36386973876765]
We prove the $mathcaltilde O(t-1/4)$ rate of convergence for the norm of the gradient of Moreau envelope.
Our analysis works with mini-batch size of $1$, constant first and second order moment parameters, and possibly smooth optimization domains.
arXiv Detail & Related papers (2020-06-11T17:43:19Z) - Non-asymptotic bounds for stochastic optimization with biased noisy
gradient oracles [8.655294504286635]
We introduce biased gradient oracles to capture a setting where the function measurements have an estimation error.
Our proposed oracles are in practical contexts, for instance, risk measure estimation from a batch of independent and identically distributed simulation.
arXiv Detail & Related papers (2020-02-26T12:53:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.