Related papers: Adaptive Learning Rates for Faster Stochastic Gradient Methods

Adaptive Learning Rates for Faster Stochastic Gradient Methods

URL: http://arxiv.org/abs/2208.05287v1
Date: Wed, 10 Aug 2022 11:36:00 GMT
Title: Adaptive Learning Rates for Faster Stochastic Gradient Methods
Authors: Samuel Horv\'ath, Konstantin Mishchenko, Peter Richt\'arik
Abstract summary: We propose new adaptive step size strategies that improve several quadratic convex gradient methods. Our first method is based on the classical Polyak step size (Polyak, 1987) and is an extension of the recent development of this method. Our second method, denoted GraDS, rescales step size by "diversity of gradients"
Score: 6.935471115003109
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this work, we propose new adaptive step size strategies that improve several stochastic gradient methods. Our first method (StoPS) is based on the classical Polyak step size (Polyak, 1987) and is an extension of the recent development of this method for the stochastic optimization-SPS (Loizou et al., 2021), and our second method, denoted GraDS, rescales step size by "diversity of stochastic gradients". We provide a theoretical analysis of these methods for strongly convex smooth functions and show they enjoy deterministic-like rates despite stochastic gradients. Furthermore, we demonstrate the theoretical superiority of our adaptive methods on quadratic objectives. Unfortunately, both StoPS and GraDS depend on unknown quantities, which are only practical for the overparametrized models. To remedy this, we drop this undesired dependence and redefine StoPS and GraDS to StoP and GraD, respectively. We show that these new methods converge linearly to the neighbourhood of the optimal solution under the same assumptions. Finally, we corroborate our theoretical claims by experimental validation, which reveals that GraD is particularly useful for deep learning optimization.

Related papers

Training Deep Learning Models with Norm-Constrained LMOs [56.00317694850397]
We study optimization methods that leverage the linear minimization oracle (LMO) over a norm-ball. We propose a new family of algorithms that uses the LMO to adapt to the geometry of the problem and, perhaps surprisingly, show that they can be applied to unconstrained problems.
arXiv Detail & Related papers (2025-02-11T13:10:34Z)
Achieving $\widetilde{\mathcal{O}}(\sqrt{T})$ Regret in Average-Reward POMDPs with Known Observation Models [56.92178753201331]
We tackle average-reward infinite-horizon POMDPs with an unknown transition model. We present a novel and simple estimator that overcomes this barrier.
arXiv Detail & Related papers (2025-01-30T22:29:41Z)
A Unified Theory of Stochastic Proximal Point Methods without Smoothness [52.30944052987393]
Proximal point methods have attracted considerable interest owing to their numerical stability and robustness against imperfect tuning. This paper presents a comprehensive analysis of a broad range of variations of the proximal point method (SPPM)
arXiv Detail & Related papers (2024-05-24T21:09:19Z)
Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems. In common practice, convergence (hyper)policies are learned only to deploy their deterministic version. We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z)
Directional Smoothness and Gradient Methods: Convergence and Adaptivity [16.779513676120096]
We develop new sub-optimality bounds for gradient descent that depend on the conditioning of the objective along the path of optimization. Key to our proofs is directional smoothness, a measure of gradient variation that we use to develop upper-bounds on the objective. We prove that the Polyak step-size and normalized GD obtain fast, path-dependent rates despite using no knowledge of the directional smoothness.
arXiv Detail & Related papers (2024-03-06T22:24:05Z)
Model-Based Reparameterization Policy Gradient Methods: Theory and Practical Algorithms [88.74308282658133]
Reization (RP) Policy Gradient Methods (PGMs) have been widely adopted for continuous control tasks in robotics and computer graphics. Recent studies have revealed that, when applied to long-term reinforcement learning problems, model-based RP PGMs may experience chaotic and non-smooth optimization landscapes. We propose a spectral normalization method to mitigate the exploding variance issue caused by long model unrolls.
arXiv Detail & Related papers (2023-10-30T18:43:21Z)
Adapting Step-size: A Unified Perspective to Analyze and Improve Gradient-based Methods for Adversarial Attacks [21.16546620434816]
We provide a unified theoretical interpretation of gradient-based adversarial learning methods. We show that each of these algorithms is in fact a specific reformulation of their original gradient methods. We present a broad class of adaptive gradient-based algorithms based on the regular gradient methods.
arXiv Detail & Related papers (2023-01-27T06:17:51Z)
A Provably Efficient Model-Free Posterior Sampling Method for Episodic Reinforcement Learning [50.910152564914405]
Existing posterior sampling methods for reinforcement learning are limited by being model-based or lack worst-case theoretical guarantees beyond linear MDPs. This paper proposes a new model-free formulation of posterior sampling that applies to more general episodic reinforcement learning problems with theoretical guarantees.
arXiv Detail & Related papers (2022-08-23T12:21:01Z)
Formal guarantees for heuristic optimization algorithms used in machine learning [6.978625807687497]
Gradient Descent (SGD) and its variants have become the dominant methods in the large-scale optimization machine learning (ML) problems. We provide formal guarantees of a few convex optimization methods and proposing improved algorithms.
arXiv Detail & Related papers (2022-07-31T19:41:22Z)
Cutting Some Slack for SGD with Adaptive Polyak Stepsizes [35.024680868164445]
We consider the family of SPS (Stochastic gradient with a Polyak Stepsize) adaptive methods. We first show that SPS and its recent variants can all be seen as extensions of the Passive-Aggressive methods applied to nonlinear problems. We use this insight to develop new variants of the SPS method that are better suited to nonlinear models.
arXiv Detail & Related papers (2022-02-24T19:31:03Z)
Local Quadratic Convergence of Stochastic Gradient Descent with Adaptive Step Size [29.15132344744801]
We establish local convergence for gradient descent with adaptive step size for problems such as matrix inversion. We show that these first order optimization methods can achieve sub-linear or linear convergence.
arXiv Detail & Related papers (2021-12-30T00:50:30Z)
A Discrete Variational Derivation of Accelerated Methods in Optimization [68.8204255655161]
We introduce variational which allow us to derive different methods for optimization. We derive two families of optimization methods in one-to-one correspondence. The preservation of symplecticity of autonomous systems occurs here solely on the fibers.
arXiv Detail & Related papers (2021-06-04T20:21:53Z)
Proximal Gradient Temporal Difference Learning: Stable Reinforcement Learning with Polynomial Sample Complexity [40.73281056650241]
We introduce proximal gradient temporal difference learning, which provides a principled way of designing and analyzing true gradient temporal difference learning algorithms. We show how gradient TD reinforcement learning methods can be formally derived, not by starting from their original objective functions, as previously attempted, but rather from a primal-dual saddle-point objective function.
arXiv Detail & Related papers (2020-06-06T21:04:21Z)
Adaptive Learning of the Optimal Batch Size of SGD [52.50880550357175]
We propose a method capable of learning the optimal batch size adaptively throughout its iterations for strongly convex and smooth functions. Our method does this provably, and in our experiments with synthetic and real data robustly exhibits nearly optimal behaviour. We generalize our method to several new batch strategies not considered in the literature before, including a sampling suitable for distributed implementations.
arXiv Detail & Related papers (2020-05-03T14:28:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.