Related papers: Optimal Linear Decay Learning Rate Schedules and Further Refinements

Optimal Linear Decay Learning Rate Schedules and Further Refinements

URL: http://arxiv.org/abs/2310.07831v2
Date: Tue, 29 Oct 2024 22:57:44 GMT
Title: Optimal Linear Decay Learning Rate Schedules and Further Refinements
Authors: Aaron Defazio, Ashok Cutkosky, Harsh Mehta, Konstantin Mishchenko,
Abstract summary: Learning rate schedules used in practice bear little resemblance to those recommended by theory. We close much of this theory/practice gap, and as a consequence are able to derive new problem-adaptive learning rate schedules.
Score: 46.79573408189601
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Learning rate schedules used in practice bear little resemblance to those recommended by theory. We close much of this theory/practice gap, and as a consequence are able to derive new problem-adaptive learning rate schedules. Our main technical contribution is a refined analysis of learning rate schedules for a wide class of optimization algorithms (including SGD). When considering only worst-case analysis, our theory predicts that the optimal choice is the linear decay schedule where the step-size is set proportional to 1 - t/T, where t is the current iteration and T is the total number of steps. To go beyond this worst-case analysis, we use the observed gradient norms to derive schedules refined for any particular task. These refined schedules exhibit learning rate warm-up and rapid learning rate annealing near the end of training. Ours is the first systematic approach to automatically yield both of these properties. We perform the most comprehensive evaluation of learning rate schedules to date, evaluating across 10 diverse deep learning problems, a series of LLMs, and a suite of logistic regression problems. We validate that overall, the linear-decay schedule outperforms all commonly used default schedules including cosine annealing. Our adaptive schedule refinement method gives further improvements.

Related papers

A Multi-Power Law for Loss Curve Prediction Across Learning Rate Schedules [67.87680482844884]
We present an empirical law that describes how the pretraining loss of large language models evolves under different learning rate schedules. Our proposed law takes a multi-power form, combining a power law based on the sum of learning rates and additional power laws to account for a loss reduction effect induced by learning rate decay.
arXiv Detail & Related papers (2025-03-17T04:36:45Z)
The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training [55.233765889424035]
We show that learning-rate schedules for large model training behave surprisingly similar to a convex bound from non-smooth optimization theory. We achieve noticeable improvements for training 124M and 210M Llama-type models by (i) extending the schedule for continued training with optimal learning-rate, and (ii) transferring the optimal learning-rate across schedules.
arXiv Detail & Related papers (2025-01-31T08:55:56Z)
Unlearning as multi-task optimization: A normalized gradient difference approach with an adaptive learning rate [105.86576388991713]
We introduce a normalized gradient difference (NGDiff) algorithm, enabling us to have better control over the trade-off between the objectives. We provide a theoretical analysis and empirically demonstrate the superior performance of NGDiff among state-of-the-art unlearning methods on the TOFU and MUSE datasets.
arXiv Detail & Related papers (2024-10-29T14:41:44Z)
The Road Less Scheduled [45.01813613035411]
Existing learning rate schedules that do not require specification of the optimization stopping step T are greatly out-performed by learning rate schedules that depend on T. We propose an approach that avoids the need for this stopping time by eschewing the use of schedules entirely. Our Schedule-Free approach introduces no additional hyper- parameters over standard schedules with momentum.
arXiv Detail & Related papers (2024-05-24T16:20:46Z)
Mechanic: A Learning Rate Tuner [52.4242550204696]
We introduce a technique for tuning the learning rate scale factor of any base optimization algorithm and schedule automatically, which we call textscmechanic. We rigorously evaluate textscmechanic on a range of large scale deep learning tasks with varying batch sizes, schedules, and base optimization algorithms.
arXiv Detail & Related papers (2023-05-31T19:32:43Z)
Learning Rate Perturbation: A Generic Plugin of Learning Rate Schedule towards Flatter Local Minima [40.70374106466073]
We propose a generic learning rate schedule plugin called LEArning Rate Perturbation (LEAP) LEAP can be applied to various learning rate schedules to improve the model training by introducing a certain perturbation to the learning rate. We conduct extensive experiments which show that training with LEAP can improve the performance of various deep learning models on diverse datasets.
arXiv Detail & Related papers (2022-08-25T05:05:18Z)
Eigencurve: Optimal Learning Rate Schedule for SGD on Quadratic Objectives with Skewed Hessian Spectrums [26.44093918424658]
Eigencurve is the first family of learning rate schedules that can achieve minimax optimal convergence rates (up to a constant) for SGD on quadratic objectives. Experimental results show that Eigencurve can significantly outperform step decay in image classification tasks. Two simple learning rate schedulers for practical applications can approximate Eigencurve.
arXiv Detail & Related papers (2021-10-27T01:17:53Z)
REX: Revisiting Budgeted Training with an Improved Schedule [14.618325490983052]
We propose a novel profile and sampling rate combination called the Reflected Exponential (REX) schedule. REX outperforms the linear schedule in the low budget regime, while matching or exceeding the performance of several state-of-the-art learning rate schedules.
arXiv Detail & Related papers (2021-07-09T04:17:35Z)
Training Aware Sigmoidal Optimizer [2.99368851209995]
Training Aware Sigmoidal functions present landscapes with much more saddle loss than local minima. We proposed the Training Aware Sigmoidal functions (TASO), which consists of a two-phases automated learning rate schedule. We compared the proposed approach with commonly used adaptive learning rate schedules such as Adam, RMS, and Adagrad.
arXiv Detail & Related papers (2021-02-17T12:00:46Z)
Direction Matters: On the Implicit Bias of Stochastic Gradient Descent with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime. We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
The Two Regimes of Deep Network Training [93.84309968956941]
We study the effects of different learning schedules and the appropriate way to select them. To this end, we isolate two distinct phases, which we refer to as the "large-step regime" and the "small-step regime" Our training algorithm can significantly simplify learning rate schedules.
arXiv Detail & Related papers (2020-02-24T17:08:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.