Learning-Rate-Free Learning by D-Adaptation
- URL: http://arxiv.org/abs/2301.07733v5
- Date: Fri, 7 Jul 2023 19:08:18 GMT
- Title: Learning-Rate-Free Learning by D-Adaptation
- Authors: Aaron Defazio and Konstantin Mishchenko
- Abstract summary: D-Adaptation is an approach to automatically setting the learning rate which achieves the optimal rate of convergence for convex Lipschitz functions.
We present extensive experiments for SGD and Adam variants of our method, where the method automatically matches hand-tuned learning rates across more than a dozen diverse machine learning problems.
- Score: 18.853820404058983
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: D-Adaptation is an approach to automatically setting the learning rate which
asymptotically achieves the optimal rate of convergence for minimizing convex
Lipschitz functions, with no back-tracking or line searches, and no additional
function value or gradient evaluations per step. Our approach is the first
hyper-parameter free method for this class without additional multiplicative
log factors in the convergence rate. We present extensive experiments for SGD
and Adam variants of our method, where the method automatically matches
hand-tuned learning rates across more than a dozen diverse machine learning
problems, including large-scale vision and language problems.
An open-source implementation is available.
Related papers
- Optimization by Parallel Quasi-Quantum Annealing with Gradient-Based Sampling [0.0]
This study proposes a different approach that integrates gradient-based update through continuous relaxation, combined with Quasi-Quantum Annealing (QQA)
Numerical experiments demonstrate that our method is a competitive general-purpose solver, achieving performance comparable to iSCO and learning-based solvers.
arXiv Detail & Related papers (2024-09-02T12:55:27Z) - Learning rate adaptive stochastic gradient descent optimization methods: numerical simulations for deep learning methods for partial differential equations and convergence analyses [5.052293146674794]
It is known that the standard descent (SGD) optimization method, as well as accelerated and adaptive SGD optimization methods such as the Adam fail to converge if the learning rates do not converge to zero.
In this work we propose and study a learning-rate-adaptive approach for SGD optimization methods in which the learning rate is adjusted based on empirical estimates.
arXiv Detail & Related papers (2024-06-20T14:07:39Z) - Adaptive Federated Learning Over the Air [108.62635460744109]
We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training.
Our analysis shows that the AdaGrad-based training algorithm converges to a stationary point at the rate of $mathcalO( ln(T) / T 1 - frac1alpha ).
arXiv Detail & Related papers (2024-03-11T09:10:37Z) - A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning
with General Function Approximation [66.26739783789387]
We propose a new algorithm, Monotonic Q-Learning with Upper Confidence Bound (MQL-UCB) for reinforcement learning.
MQL-UCB achieves minimax optimal regret of $tildeO(dsqrtHK)$ when $K$ is sufficiently large and near-optimal policy switching cost.
Our work sheds light on designing provably sample-efficient and deployment-efficient Q-learning with nonlinear function approximation.
arXiv Detail & Related papers (2023-11-26T08:31:57Z) - BOME! Bilevel Optimization Made Easy: A Simple First-Order Approach [46.457298683984924]
Bilevel optimization (BO) is useful for solving a variety important machine learning problems.
Conventional methods need to differentiate through the low-level optimization process with implicit differentiation.
First-order BO depends only on first-order information, requires no implicit differentiation.
arXiv Detail & Related papers (2022-09-19T01:51:12Z) - Stabilizing Q-learning with Linear Architectures for Provably Efficient
Learning [53.17258888552998]
This work proposes an exploration variant of the basic $Q$-learning protocol with linear function approximation.
We show that the performance of the algorithm degrades very gracefully under a novel and more permissive notion of approximation error.
arXiv Detail & Related papers (2022-06-01T23:26:51Z) - Simple Stochastic and Online Gradient DescentAlgorithms for Pairwise
Learning [65.54757265434465]
Pairwise learning refers to learning tasks where the loss function depends on a pair instances.
Online descent (OGD) is a popular approach to handle streaming data in pairwise learning.
In this paper, we propose simple and online descent to methods for pairwise learning.
arXiv Detail & Related papers (2021-11-23T18:10:48Z) - A Boosting Approach to Reinforcement Learning [59.46285581748018]
We study efficient algorithms for reinforcement learning in decision processes whose complexity is independent of the number of states.
We give an efficient algorithm that is capable of improving the accuracy of such weak learning methods.
arXiv Detail & Related papers (2021-08-22T16:00:45Z) - Learning Sampling Policy for Faster Derivative Free Optimization [100.27518340593284]
We propose a new reinforcement learning based ZO algorithm (ZO-RL) with learning the sampling policy for generating the perturbations in ZO optimization instead of using random sampling.
Our results show that our ZO-RL algorithm can effectively reduce the variances of ZO gradient by learning a sampling policy, and converge faster than existing ZO algorithms in different scenarios.
arXiv Detail & Related papers (2021-04-09T14:50:59Z) - Automatic, Dynamic, and Nearly Optimal Learning Rate Specification by
Local Quadratic Approximation [7.386152866234369]
In deep learning tasks, the learning rate determines the update step size in each iteration.
We propose a novel optimization method based on local quadratic approximation (LQA)
arXiv Detail & Related papers (2020-04-07T10:55:12Z) - Statistical Adaptive Stochastic Gradient Methods [34.859895010071234]
We propose a statistical adaptive procedure called SALSA for automatically scheduling the learning rate (step size) in gradient methods.
SALSA first uses a smoothed line-search procedure to gradually increase the learning rate, then automatically decreases the learning rate.
The method for decreasing the learning rate is based on a new statistical test for detecting station switches when using a constant step size.
arXiv Detail & Related papers (2020-02-25T00:04:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.