Adaptive SGD with Polyak stepsize and Line-search: Robust Convergence
and Variance Reduction
- URL: http://arxiv.org/abs/2308.06058v2
- Date: Mon, 21 Aug 2023 16:28:13 GMT
- Title: Adaptive SGD with Polyak stepsize and Line-search: Robust Convergence
and Variance Reduction
- Authors: Xiaowen Jiang and Sebastian U. Stich
- Abstract summary: We propose two new variants of SPS and SLS, called AdaSPS and AdaSLS, which guarantee convergence in non-interpolation settings.
We equip AdaSPS and AdaSLS with a novel variance reduction technique and obtain algorithms that require $smashwidetildemathcalO(n+1/epsilon)$ gradient evaluations.
- Score: 26.9632099249085
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recently proposed stochastic Polyak stepsize (SPS) and stochastic
line-search (SLS) for SGD have shown remarkable effectiveness when training
over-parameterized models. However, in non-interpolation settings, both
algorithms only guarantee convergence to a neighborhood of a solution which may
result in a worse output than the initial guess. While artificially decreasing
the adaptive stepsize has been proposed to address this issue (Orvieto et al.
[2022]), this approach results in slower convergence rates for convex and
over-parameterized models. In this work, we make two contributions: Firstly, we
propose two new variants of SPS and SLS, called AdaSPS and AdaSLS, which
guarantee convergence in non-interpolation settings and maintain sub-linear and
linear convergence rates for convex and strongly convex functions when training
over-parameterized models. AdaSLS requires no knowledge of problem-dependent
parameters, and AdaSPS requires only a lower bound of the optimal function
value as input. Secondly, we equip AdaSPS and AdaSLS with a novel variance
reduction technique and obtain algorithms that require
$\smash{\widetilde{\mathcal{O}}}(n+1/\epsilon)$ gradient evaluations to achieve
an $\mathcal{O}(\epsilon)$-suboptimality for convex functions, which improves
upon the slower $\mathcal{O}(1/\epsilon^2)$ rates of AdaSPS and AdaSLS without
variance reduction in the non-interpolation regimes. Moreover, our result
matches the fast rates of AdaSVRG but removes the inner-outer-loop structure,
which is easier to implement and analyze. Finally, numerical experiments on
synthetic and real datasets validate our theory and demonstrate the
effectiveness and robustness of our algorithms.
Related papers
- Methods for Convex $(L_0,L_1)$-Smooth Optimization: Clipping, Acceleration, and Adaptivity [50.25258834153574]
We focus on the class of (strongly) convex $(L0)$-smooth functions and derive new convergence guarantees for several existing methods.
In particular, we derive improved convergence rates for Gradient Descent with smoothnessed Gradient Clipping and for Gradient Descent with Polyak Stepsizes.
arXiv Detail & Related papers (2024-09-23T13:11:37Z) - Adaptive Variance Reduction for Stochastic Optimization under Weaker Assumptions [26.543628010637036]
We introduce a novel adaptive reduction method that achieves an optimal convergence rate of $mathcalO(log T)$ for non- functions.
We also extend the proposed technique to obtain the same optimal rate of $mathcalO(log T)$ for compositional optimization.
arXiv Detail & Related papers (2024-06-04T04:39:51Z) - Double Variance Reduction: A Smoothing Trick for Composite Optimization Problems without First-Order Gradient [40.22217106270146]
Variance reduction techniques are designed to decrease the sampling variance, thereby accelerating convergence rates of first-order (FO) and zeroth-order (ZO) optimization methods.
In composite optimization problems, ZO methods encounter an additional variance called the coordinate-wise variance, which stems from the random estimation.
This paper proposes the Zeroth-order Proximal Double Variance Reduction (ZPDVR) method, which utilizes the averaging trick to reduce both sampling and coordinate-wise variances.
arXiv Detail & Related papers (2024-05-28T02:27:53Z) - Stochastic Optimization for Non-convex Problem with Inexact Hessian
Matrix, Gradient, and Function [99.31457740916815]
Trust-region (TR) and adaptive regularization using cubics have proven to have some very appealing theoretical properties.
We show that TR and ARC methods can simultaneously provide inexact computations of the Hessian, gradient, and function values.
arXiv Detail & Related papers (2023-10-18T10:29:58Z) - Stochastic Mirror Descent: Convergence Analysis and Adaptive Variants
via the Mirror Stochastic Polyak Stepsize [20.376216873620763]
We investigate the convergence of mirror descent (SMD) under in relatively smooth and smooth convex optimization.
We propose a new adaptive stepsize scheme -- the mirror Polyak stepsize (mSPS)
arXiv Detail & Related papers (2021-10-28T19:49:40Z) - Faster Algorithm and Sharper Analysis for Constrained Markov Decision
Process [56.55075925645864]
The problem of constrained decision process (CMDP) is investigated, where an agent aims to maximize the expected accumulated discounted reward subject to multiple constraints.
A new utilities-dual convex approach is proposed with novel integration of three ingredients: regularized policy, dual regularizer, and Nesterov's gradient descent dual.
This is the first demonstration that nonconcave CMDP problems can attain the lower bound of $mathcal O (1/epsilon)$ for all complexity optimization subject to convex constraints.
arXiv Detail & Related papers (2021-10-20T02:57:21Z) - Momentum Accelerates the Convergence of Stochastic AUPRC Maximization [80.8226518642952]
We study optimization of areas under precision-recall curves (AUPRC), which is widely used for imbalanced tasks.
We develop novel momentum methods with a better iteration of $O (1/epsilon4)$ for finding an $epsilon$stationary solution.
We also design a novel family of adaptive methods with the same complexity of $O (1/epsilon4)$, which enjoy faster convergence in practice.
arXiv Detail & Related papers (2021-07-02T16:21:52Z) - Reducing the Variance of Gaussian Process Hyperparameter Optimization
with Preconditioning [54.01682318834995]
Preconditioning is a highly effective step for any iterative method involving matrix-vector multiplication.
We prove that preconditioning has an additional benefit that has been previously unexplored.
It simultaneously can reduce variance at essentially negligible cost.
arXiv Detail & Related papers (2021-07-01T06:43:11Z) - On Stochastic Moving-Average Estimators for Non-Convex Optimization [105.22760323075008]
In this paper, we demonstrate the power of a widely used estimator based on moving average (SEMA) problems.
For all these-the-art results, we also present the results for all these-the-art problems.
arXiv Detail & Related papers (2021-04-30T08:50:24Z) - On Stochastic Variance Reduced Gradient Method for Semidefinite
Optimization [14.519696724619074]
The SVRG method has been regarded as one of the most effective methods.
There is a significant gap between the theory and practice when adapted to the semidefinite programming (SDP)
In this paper, we fill this gap via exploiting a new variant of the original SVRG with Option I adapted to the semidefinite optimization.
arXiv Detail & Related papers (2021-01-01T13:55:32Z) - Adaptive extra-gradient methods for min-max optimization and games [35.02879452114223]
We present a new family of minmax optimization algorithms that automatically exploit the geometry of the gradient data observed at earlier iterations.
Thanks to this adaptation mechanism, the proposed method automatically detects whether the problem is smooth or not.
It converges to an $varepsilon$-optimal solution within $mathcalO (1/varepsilon)$ iterations in smooth problems, and within $mathcalO (1/varepsilon)$ iterations in non-smooth ones.
arXiv Detail & Related papers (2020-10-22T22:54:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.