Stochastic Polyak Step-size for SGD: An Adaptive Learning Rate for Fast
Convergence
- URL: http://arxiv.org/abs/2002.10542v3
- Date: Mon, 22 Mar 2021 14:53:59 GMT
- Title: Stochastic Polyak Step-size for SGD: An Adaptive Learning Rate for Fast
Convergence
- Authors: Nicolas Loizou, Sharan Vaswani, Issam Laradji, Simon Lacoste-Julien
- Abstract summary: We propose a variant of the classical Polyak step-size (Polyak, 1987) commonly used in the subgradient method.
The proposed Polyak step-size (SPS) is an attractive choice for setting the learning rate for gradient descent.
- Score: 30.393999722555154
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a stochastic variant of the classical Polyak step-size (Polyak,
1987) commonly used in the subgradient method. Although computing the Polyak
step-size requires knowledge of the optimal function values, this information
is readily available for typical modern machine learning applications.
Consequently, the proposed stochastic Polyak step-size (SPS) is an attractive
choice for setting the learning rate for stochastic gradient descent (SGD). We
provide theoretical convergence guarantees for SGD equipped with SPS in
different settings, including strongly convex, convex and non-convex functions.
Furthermore, our analysis results in novel convergence guarantees for SGD with
a constant step-size. We show that SPS is particularly effective when training
over-parameterized models capable of interpolating the training data. In this
setting, we prove that SPS enables SGD to converge to the true solution at a
fast rate without requiring the knowledge of any problem-dependent constants or
additional computational overhead. We experimentally validate our theoretical
results via extensive experiments on synthetic and real datasets. We
demonstrate the strong performance of SGD with SPS compared to state-of-the-art
optimization methods when training over-parameterized models.
Related papers
- Stochastic Polyak Step-sizes and Momentum: Convergence Guarantees and Practical Performance [10.11126899274029]
We propose and explore new Polyak-type variants suitable for the update rule of the Heavy Ball method (SHB)
For MomSPS$_max$, we provide guarantees for SHB to a neighborhood of the solution for convex and smooth problems (without assuming)
The other two variants, MomDecSPS and MomAdaSPS, are the first adaptive step-sizes for SHB that guarantee convergence to the exact minimizer without prior knowledge.
arXiv Detail & Related papers (2024-06-06T15:08:06Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Stochastic Gradient Descent with Preconditioned Polyak Step-size [1.3300175008796402]
Gradient Descent with Polyak Step-size (SPS) is a method that offers an update rule that alleviates the need of fine-tuning the learning rate of an dataset.
In this paper, we propose an extension of SPS that employs preconditioning techniques, such as Hutchinson's and Ada's, to improve its performance on badly scaled and/or ill-conditioned datasets.
arXiv Detail & Related papers (2023-10-03T14:36:05Z) - Faster Stochastic Variance Reduction Methods for Compositional MiniMax
Optimization [50.10952609321302]
compositional minimax optimization is a pivotal challenge across various machine learning domains.
Current methods of compositional minimax optimization are plagued by sub-optimal complexities or heavy reliance on sizable batch sizes.
This paper introduces a novel method, called Nested STOchastic Recursive Momentum (NSTORM), which can achieve the optimal sample complexity of $O(kappa3 /epsilon3 )$.
arXiv Detail & Related papers (2023-08-18T14:57:21Z) - Adaptive SGD with Polyak stepsize and Line-search: Robust Convergence
and Variance Reduction [26.9632099249085]
We propose two new variants of SPS and SLS, called AdaSPS and AdaSLS, which guarantee convergence in non-interpolation settings.
We equip AdaSPS and AdaSLS with a novel variance reduction technique and obtain algorithms that require $smashwidetildemathcalO(n+1/epsilon)$ gradient evaluations.
arXiv Detail & Related papers (2023-08-11T10:17:29Z) - SketchySGD: Reliable Stochastic Optimization via Randomized Curvature
Estimates [19.420605210427635]
SketchySGD improves upon existing gradient methods in machine learning by using randomized low-rank approximations to the subsampled Hessian.
We show theoretically that SketchySGD with a fixed stepsize converges linearly to a small ball around the optimum.
In the ill-conditioned setting we show SketchySGD converges at a faster rate than SGD for least-squares problems.
arXiv Detail & Related papers (2022-11-16T01:05:41Z) - PAC Reinforcement Learning for Predictive State Representations [60.00237613646686]
We study online Reinforcement Learning (RL) in partially observable dynamical systems.
We focus on the Predictive State Representations (PSRs) model, which is an expressive model that captures other well-known models.
We develop a novel model-based algorithm for PSRs that can learn a near optimal policy in sample complexity scalingly.
arXiv Detail & Related papers (2022-07-12T17:57:17Z) - Bayesian Sparse learning with preconditioned stochastic gradient MCMC
and its applications [5.660384137948734]
The proposed algorithm converges to the correct distribution with a controllable bias under mild conditions.
We show that the proposed algorithm canally converge to the correct distribution with a controllable bias under mild conditions.
arXiv Detail & Related papers (2020-06-29T20:57:20Z) - AdaS: Adaptive Scheduling of Stochastic Gradients [50.80697760166045]
We introduce the notions of textit"knowledge gain" and textit"mapping condition" and propose a new algorithm called Adaptive Scheduling (AdaS)
Experimentation reveals that, using the derived metrics, AdaS exhibits: (a) faster convergence and superior generalization over existing adaptive learning methods; and (b) lack of dependence on a validation set to determine when to stop training.
arXiv Detail & Related papers (2020-06-11T16:36:31Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - Adaptive Learning of the Optimal Batch Size of SGD [52.50880550357175]
We propose a method capable of learning the optimal batch size adaptively throughout its iterations for strongly convex and smooth functions.
Our method does this provably, and in our experiments with synthetic and real data robustly exhibits nearly optimal behaviour.
We generalize our method to several new batch strategies not considered in the literature before, including a sampling suitable for distributed implementations.
arXiv Detail & Related papers (2020-05-03T14:28:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.