Related papers: Theory of Optimal Learning Rate Schedules and Scaling Laws for a Random Feature Model

Theory of Optimal Learning Rate Schedules and Scaling Laws for a Random Feature Model

URL: http://arxiv.org/abs/2602.04774v1
Date: Wed, 04 Feb 2026 17:11:36 GMT
Title: Theory of Optimal Learning Rate Schedules and Scaling Laws for a Random Feature Model
Authors: Blake Bordelon, Francesco Mori,
Abstract summary: We explore a solvable model of optimal learning rate schedules for a powerlaw random feature model trained with gradient descent (SGD)<n>In the hard phase, the optimal schedule resembles warmup-stable-decay with constant (in $T$) initial learning rate and performed over a vanishing fraction of training steps.<n>Our model also predicts the compute-optimal scaling laws (where model size and training steps are chosen) in both easy and hard regimes.
Score: 19.00191673972499
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Setting the learning rate for a deep learning model is a critical part of successful training, yet choosing this hyperparameter is often done empirically with trial and error. In this work, we explore a solvable model of optimal learning rate schedules for a powerlaw random feature model trained with stochastic gradient descent (SGD). We consider the optimal schedule $η_T^\star(t)$ where $t$ is the current iterate and $T$ is the total training horizon. This schedule is computed both numerically and analytically (when possible) using optimal control methods. Our analysis reveals two regimes which we term the easy phase and hard phase. In the easy phase the optimal schedule is a polynomial decay $η_T^\star(t) \simeq T^{-ξ} (1-t/T)^δ$ where $ξ$ and $δ$ depend on the properties of the features and task. In the hard phase, the optimal schedule resembles warmup-stable-decay with constant (in $T$) initial learning rate and annealing performed over a vanishing (in $T$) fraction of training steps. We investigate joint optimization of learning rate and batch size, identifying a degenerate optimality condition. Our model also predicts the compute-optimal scaling laws (where model size and training steps are chosen optimally) in both easy and hard regimes. Going beyond SGD, we consider optimal schedules for the momentum $β(t)$, where speedups in the hard phase are possible. We compare our optimal schedule to various benchmarks in our task including (1) optimal constant learning rates $η_T(t) \sim T^{-ξ}$ (2) optimal power laws $η_T(t) \sim T^{-ξ} t^{-χ}$, finding that our schedule achieves better rates than either of these. Our theory suggests that learning rate transfer across training horizon depends on the structure of the model and task. We explore these ideas in simple experimental pretraining setups.

Related papers

Optimal Learning-Rate Schedules under Functional Scaling Laws: Power Decay and Warmup-Stable-Decay [9.371921537573346]
We study optimal learning-rate schedules (LRSs) under the functional scaling law.<n>LRSs accurately models the loss dynamics of both linear regression and large language model (LLM) pre-training.<n>We analyze optimal shape-fixed schedules, where only the peak learning rate is tuned.
arXiv Detail & Related papers (2026-02-06T15:52:30Z)
Scaling and Transferability of Annealing Strategies in Large Language Model Training [59.443651879173025]
We refine a predictive framework for optimizing annealing strategies under the Warmup-Steady-Decay (WSD) scheduler.<n>Our improved framework incorporates training steps, maximum learning rate, and annealing behavior, enabling more efficient optimization of learning rate schedules.<n>We validate our findings on extensive experiments using both Dense and Mixture-of-Experts (MoE) models.
arXiv Detail & Related papers (2025-12-05T16:38:33Z)
Hyperparameter Transfer Enables Consistent Gains of Matrix-Preconditioned Optimizers Across Scales [55.91454326946738]
We study how the optimal learning rate and weight decay should scale with model width and depth for a wide range of languages.<n>We find that scaling the learning rate according to $$P improves transfer, but can still suffer from significant finite-width deviations.<n>For compute-optimal scaling, we find scaling independent weight decay as $1/mathrmwidth$ is nearly optimal across languages.
arXiv Detail & Related papers (2025-12-05T11:03:41Z)
Optimal Linear Decay Learning Rate Schedules and Further Refinements [46.79573408189601]
Learning rate schedules used in practice bear little resemblance to those recommended by theory. We close much of this theory/practice gap, and as a consequence are able to derive new problem-adaptive learning rate schedules.
arXiv Detail & Related papers (2023-10-11T19:16:35Z)
Nearly Minimax Optimal Reinforcement Learning for Linear Markov Decision Processes [80.89852729380425]
We propose the first computationally efficient algorithm that achieves the nearly minimax optimal regret $tilde O(dsqrtH3K)$. Our work provides a complete answer to optimal RL with linear MDPs, and the developed algorithm and theoretical tools may be of independent interest.
arXiv Detail & Related papers (2022-12-12T18:58:59Z)
Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation [107.54516740713969]
We study human-in-the-loop reinforcement learning (RL) with trajectory preferences. Instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer. We propose the first optimistic model-based algorithm for PbRL with general function approximation.
arXiv Detail & Related papers (2022-05-23T09:03:24Z)
Optimal learning rate schedules in high-dimensional non-convex optimization problems [14.058580956992051]
Learning rate schedules are ubiquitously used to speed up and improve optimisation. We present a first analytical study of the role of neural scheduling in this setting.
arXiv Detail & Related papers (2022-02-09T15:15:39Z)
Eigencurve: Optimal Learning Rate Schedule for SGD on Quadratic Objectives with Skewed Hessian Spectrums [26.44093918424658]
Eigencurve is the first family of learning rate schedules that can achieve minimax optimal convergence rates (up to a constant) for SGD on quadratic objectives. Experimental results show that Eigencurve can significantly outperform step decay in image classification tasks. Two simple learning rate schedulers for practical applications can approximate Eigencurve.
arXiv Detail & Related papers (2021-10-27T01:17:53Z)
Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free Reinforcement Learning [52.76230802067506]
A novel model-free algorithm is proposed to minimize regret in episodic reinforcement learning. The proposed algorithm employs an em early-settled reference update rule, with the aid of two Q-learning sequences. The design principle of our early-settled variance reduction method might be of independent interest to other RL settings.
arXiv Detail & Related papers (2021-10-09T21:13:48Z)
REX: Revisiting Budgeted Training with an Improved Schedule [14.618325490983052]
We propose a novel profile and sampling rate combination called the Reflected Exponential (REX) schedule. REX outperforms the linear schedule in the low budget regime, while matching or exceeding the performance of several state-of-the-art learning rate schedules.
arXiv Detail & Related papers (2021-07-09T04:17:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.