When, Why and How Much? Adaptive Learning Rate Scheduling by Refinement
- URL: http://arxiv.org/abs/2310.07831v1
- Date: Wed, 11 Oct 2023 19:16:35 GMT
- Title: When, Why and How Much? Adaptive Learning Rate Scheduling by Refinement
- Authors: Aaron Defazio and Ashok Cutkosky and Harsh Mehta and Konstantin
Mishchenko
- Abstract summary: Learning rate schedules used in practice bear little resemblance to those recommended by theory.
We close much of this theory/practice gap, and as a consequence are able to derive new problem-adaptive learning rate schedules.
- Score: 51.12097770185634
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Learning rate schedules used in practice bear little resemblance to those
recommended by theory. We close much of this theory/practice gap, and as a
consequence are able to derive new problem-adaptive learning rate schedules.
Our key technical contribution is a refined analysis of learning rate schedules
for a wide class of optimization algorithms (including SGD). In contrast to
most prior works that study the convergence of the average iterate, we study
the last iterate, which is what most people use in practice. When considering
only worst-case analysis, our theory predicts that the best choice is the
linear decay schedule: a popular choice in practice that sets the stepsize
proportionally to $1 - t/T$, where $t$ is the current iteration and $T$ is the
total number of steps. To go beyond this worst-case analysis, we use the
observed gradient norms to derive schedules refined for any particular task.
These refined schedules exhibit learning rate warm-up and rapid learning rate
annealing near the end of training. Ours is the first systematic approach to
automatically yield both of these properties. We perform the most comprehensive
evaluation of learning rate schedules to date, evaluating across 10 diverse
deep learning problems, a series of LLMs, and a suite of logistic regression
problems. We validate that overall, the linear-decay schedule matches or
outperforms all commonly used default schedules including cosine annealing, and
that our schedule refinement method gives further improvements.
Related papers
- The Road Less Scheduled [75.09232139131437]
Existing learning rate schedules that do not require specification of the optimization stopping step T are greatly out-performed by learning rate schedules that depend on T.
We propose an approach that avoids the need for this stopping time by eschewing the use of schedules entirely.
arXiv Detail & Related papers (2024-05-24T16:20:46Z) - Mechanic: A Learning Rate Tuner [52.4242550204696]
We introduce a technique for tuning the learning rate scale factor of any base optimization algorithm and schedule automatically, which we call textscmechanic.
We rigorously evaluate textscmechanic on a range of large scale deep learning tasks with varying batch sizes, schedules, and base optimization algorithms.
arXiv Detail & Related papers (2023-05-31T19:32:43Z) - Nearly Minimax Optimal Reinforcement Learning for Linear Markov Decision
Processes [80.89852729380425]
We propose the first computationally efficient algorithm that achieves the nearly minimax optimal regret $tilde O(dsqrtH3K)$.
Our work provides a complete answer to optimal RL with linear MDPs, and the developed algorithm and theoretical tools may be of independent interest.
arXiv Detail & Related papers (2022-12-12T18:58:59Z) - Learning Rate Perturbation: A Generic Plugin of Learning Rate Schedule
towards Flatter Local Minima [40.70374106466073]
We propose a generic learning rate schedule plugin called LEArning Rate Perturbation (LEAP)
LEAP can be applied to various learning rate schedules to improve the model training by introducing a certain perturbation to the learning rate.
We conduct extensive experiments which show that training with LEAP can improve the performance of various deep learning models on diverse datasets.
arXiv Detail & Related papers (2022-08-25T05:05:18Z) - Eigencurve: Optimal Learning Rate Schedule for SGD on Quadratic
Objectives with Skewed Hessian Spectrums [26.44093918424658]
Eigencurve is the first family of learning rate schedules that can achieve minimax optimal convergence rates (up to a constant) for SGD on quadratic objectives.
Experimental results show that Eigencurve can significantly outperform step decay in image classification tasks.
Two simple learning rate schedulers for practical applications can approximate Eigencurve.
arXiv Detail & Related papers (2021-10-27T01:17:53Z) - REX: Revisiting Budgeted Training with an Improved Schedule [14.618325490983052]
We propose a novel profile and sampling rate combination called the Reflected Exponential (REX) schedule.
REX outperforms the linear schedule in the low budget regime, while matching or exceeding the performance of several state-of-the-art learning rate schedules.
arXiv Detail & Related papers (2021-07-09T04:17:35Z) - Training Aware Sigmoidal Optimizer [2.99368851209995]
Training Aware Sigmoidal functions present landscapes with much more saddle loss than local minima.
We proposed the Training Aware Sigmoidal functions (TASO), which consists of a two-phases automated learning rate schedule.
We compared the proposed approach with commonly used adaptive learning rate schedules such as Adam, RMS, and Adagrad.
arXiv Detail & Related papers (2021-02-17T12:00:46Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - Automatic Tuning of Stochastic Gradient Descent with Bayesian
Optimisation [8.340191147575307]
We introduce an original probabilistic model for traces of optimisers, based on latent Gaussian processes and an auto-/regressive formulation.
It flexibly adjusts to abrupt changes of behaviours induced by new learning rate values.
It is well-suited to tackle a set of problems: first, for the on-line adaptation of the learning rate for a cold-started run; then, for tuning the schedule for a set of similar tasks, as well as warm-starting it for a new task.
arXiv Detail & Related papers (2020-06-25T13:18:18Z) - The Two Regimes of Deep Network Training [93.84309968956941]
We study the effects of different learning schedules and the appropriate way to select them.
To this end, we isolate two distinct phases, which we refer to as the "large-step regime" and the "small-step regime"
Our training algorithm can significantly simplify learning rate schedules.
arXiv Detail & Related papers (2020-02-24T17:08:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.