Adaptive Learning Rates for Faster Stochastic Gradient Methods
- URL: http://arxiv.org/abs/2208.05287v1
- Date: Wed, 10 Aug 2022 11:36:00 GMT
- Title: Adaptive Learning Rates for Faster Stochastic Gradient Methods
- Authors: Samuel Horv\'ath, Konstantin Mishchenko, Peter Richt\'arik
- Abstract summary: We propose new adaptive step size strategies that improve several quadratic convex gradient methods.
Our first method is based on the classical Polyak step size (Polyak, 1987) and is an extension of the recent development of this method.
Our second method, denoted GraDS, rescales step size by "diversity of gradients"
- Score: 6.935471115003109
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we propose new adaptive step size strategies that improve
several stochastic gradient methods. Our first method (StoPS) is based on the
classical Polyak step size (Polyak, 1987) and is an extension of the recent
development of this method for the stochastic optimization-SPS (Loizou et al.,
2021), and our second method, denoted GraDS, rescales step size by "diversity
of stochastic gradients". We provide a theoretical analysis of these methods
for strongly convex smooth functions and show they enjoy deterministic-like
rates despite stochastic gradients. Furthermore, we demonstrate the theoretical
superiority of our adaptive methods on quadratic objectives. Unfortunately,
both StoPS and GraDS depend on unknown quantities, which are only practical for
the overparametrized models. To remedy this, we drop this undesired dependence
and redefine StoPS and GraDS to StoP and GraD, respectively. We show that these
new methods converge linearly to the neighbourhood of the optimal solution
under the same assumptions. Finally, we corroborate our theoretical claims by
experimental validation, which reveals that GraD is particularly useful for
deep learning optimization.
Related papers
- A Unified Theory of Stochastic Proximal Point Methods without Smoothness [52.30944052987393]
Proximal point methods have attracted considerable interest owing to their numerical stability and robustness against imperfect tuning.
This paper presents a comprehensive analysis of a broad range of variations of the proximal point method (SPPM)
arXiv Detail & Related papers (2024-05-24T21:09:19Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Model-Based Reparameterization Policy Gradient Methods: Theory and
Practical Algorithms [88.74308282658133]
Reization (RP) Policy Gradient Methods (PGMs) have been widely adopted for continuous control tasks in robotics and computer graphics.
Recent studies have revealed that, when applied to long-term reinforcement learning problems, model-based RP PGMs may experience chaotic and non-smooth optimization landscapes.
We propose a spectral normalization method to mitigate the exploding variance issue caused by long model unrolls.
arXiv Detail & Related papers (2023-10-30T18:43:21Z) - Adapting Step-size: A Unified Perspective to Analyze and Improve
Gradient-based Methods for Adversarial Attacks [21.16546620434816]
We provide a unified theoretical interpretation of gradient-based adversarial learning methods.
We show that each of these algorithms is in fact a specific reformulation of their original gradient methods.
We present a broad class of adaptive gradient-based algorithms based on the regular gradient methods.
arXiv Detail & Related papers (2023-01-27T06:17:51Z) - A Provably Efficient Model-Free Posterior Sampling Method for Episodic
Reinforcement Learning [50.910152564914405]
Existing posterior sampling methods for reinforcement learning are limited by being model-based or lack worst-case theoretical guarantees beyond linear MDPs.
This paper proposes a new model-free formulation of posterior sampling that applies to more general episodic reinforcement learning problems with theoretical guarantees.
arXiv Detail & Related papers (2022-08-23T12:21:01Z) - Formal guarantees for heuristic optimization algorithms used in machine
learning [6.978625807687497]
Gradient Descent (SGD) and its variants have become the dominant methods in the large-scale optimization machine learning (ML) problems.
We provide formal guarantees of a few convex optimization methods and proposing improved algorithms.
arXiv Detail & Related papers (2022-07-31T19:41:22Z) - Cutting Some Slack for SGD with Adaptive Polyak Stepsizes [35.024680868164445]
We consider the family of SPS (Stochastic gradient with a Polyak Stepsize) adaptive methods.
We first show that SPS and its recent variants can all be seen as extensions of the Passive-Aggressive methods applied to nonlinear problems.
We use this insight to develop new variants of the SPS method that are better suited to nonlinear models.
arXiv Detail & Related papers (2022-02-24T19:31:03Z) - Local Quadratic Convergence of Stochastic Gradient Descent with Adaptive
Step Size [29.15132344744801]
We establish local convergence for gradient descent with adaptive step size for problems such as matrix inversion.
We show that these first order optimization methods can achieve sub-linear or linear convergence.
arXiv Detail & Related papers (2021-12-30T00:50:30Z) - A Discrete Variational Derivation of Accelerated Methods in Optimization [68.8204255655161]
We introduce variational which allow us to derive different methods for optimization.
We derive two families of optimization methods in one-to-one correspondence.
The preservation of symplecticity of autonomous systems occurs here solely on the fibers.
arXiv Detail & Related papers (2021-06-04T20:21:53Z) - Proximal Gradient Temporal Difference Learning: Stable Reinforcement
Learning with Polynomial Sample Complexity [40.73281056650241]
We introduce proximal gradient temporal difference learning, which provides a principled way of designing and analyzing true gradient temporal difference learning algorithms.
We show how gradient TD reinforcement learning methods can be formally derived, not by starting from their original objective functions, as previously attempted, but rather from a primal-dual saddle-point objective function.
arXiv Detail & Related papers (2020-06-06T21:04:21Z) - Adaptive Learning of the Optimal Batch Size of SGD [52.50880550357175]
We propose a method capable of learning the optimal batch size adaptively throughout its iterations for strongly convex and smooth functions.
Our method does this provably, and in our experiments with synthetic and real data robustly exhibits nearly optimal behaviour.
We generalize our method to several new batch strategies not considered in the literature before, including a sampling suitable for distributed implementations.
arXiv Detail & Related papers (2020-05-03T14:28:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.