Related papers: Schedulers for Schedule-free: Theoretically inspired hyperparameters

Schedulers for Schedule-free: Theoretically inspired hyperparameters

URL: http://arxiv.org/abs/2511.07767v1
Date: Wed, 12 Nov 2025 01:16:28 GMT
Title: Schedulers for Schedule-free: Theoretically inspired hyperparameters
Authors: Yuen-Man Pun, Matthew Buchholz, Robert M. Gower,
Abstract summary: We show how to extend the last-iterate convergence theory of schedule-free to allow for any scheduler.<n>We then use convexity to design a new adaptive Polyak learning rate schedule for schedule-free.
Score: 9.569316316728903
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The recently proposed schedule-free method has been shown to achieve strong performance when hyperparameter tuning is limited. The current theory for schedule-free only supports a constant learning rate, where-as the implementation used in practice uses a warm-up schedule. We show how to extend the last-iterate convergence theory of schedule-free to allow for any scheduler, and how the averaging parameter has to be updated as a function of the learning rate. We then perform experiments showing how our convergence theory has some predictive power with regards to practical executions on deep neural networks, despite that this theory relies on assuming convexity. When applied to the warmup-stable-decay (wsd) schedule, our theory shows the optimal convergence rate of $\mathcal{O}(1/\sqrt{T})$. We then use convexity to design a new adaptive Polyak learning rate schedule for schedule-free. We prove an optimal anytime last-iterate convergence for our new Polyak schedule, and show that it performs well compared to a number of baselines on a black-box model distillation task.

Related papers

Where Does Warm-Up Come From? Adaptive Scheduling for Norm-Constrained Optimizers [43.838677595865846]
We develop a practical learning rate scheduler that adapts the warm-up duration automatically at the beginning of training.<n>We evaluate this method on large language model pretraining with LLaMA architectures and show that our adaptive warm-up selection consistently outperforms or at least matches the best manually tuned warm-up schedules.
arXiv Detail & Related papers (2026-02-05T16:06:19Z)
Theory of Optimal Learning Rate Schedules and Scaling Laws for a Random Feature Model [19.00191673972499]
We explore a solvable model of optimal learning rate schedules for a powerlaw random feature model trained with gradient descent (SGD)<n>In the hard phase, the optimal schedule resembles warmup-stable-decay with constant (in $T$) initial learning rate and performed over a vanishing fraction of training steps.<n>Our model also predicts the compute-optimal scaling laws (where model size and training steps are chosen) in both easy and hard regimes.
arXiv Detail & Related papers (2026-02-04T17:11:36Z)
The Role of Target Update Frequencies in Q-Learning [4.76285598583384]
The target network update frequency (TUF) is a central stabilization mechanism in (deep) Q-learning.<n>We formulate periodic target updates as a nested optimization scheme in which each outer iteration applies an inexact Bellman optimality operator.<n>We show that the optimal target update frequency increases geometrically over the course of the learning process.
arXiv Detail & Related papers (2026-02-03T15:19:20Z)
Dynamic Learning Rate Scheduling based on Loss Changes Leads to Faster Convergence [2.1665689529884697]
emphGreedyLR is a novel scheduler that adaptively adjusts the learning rate during training based on the current loss.<n>Our approach outperforms several state-of-the-art schedulers in terms of accuracy, speed, and convergence.
arXiv Detail & Related papers (2025-12-16T16:03:52Z)
Beyond the Ideal: Analyzing the Inexact Muon Update [54.70108543057578]
We show first analysis of the inexactized update at Muon's core.<n>We reveal a fundamental coupling between this inexactness and the optimal step size and momentum.
arXiv Detail & Related papers (2025-10-22T18:01:07Z)
Test time training enhances in-context learning of nonlinear functions [51.56484100374058]
Test-time training (TTT) enhances model performance by explicitly updating designated parameters prior to each prediction.<n>We investigate the combination of TTT with in-context learning (ICL), where the model is given a few examples from the target distribution at inference time.
arXiv Detail & Related papers (2025-09-30T03:56:44Z)
Noise Hypernetworks: Amortizing Test-Time Compute in Diffusion Models [57.49136894315871]
New paradigm of test-time scaling has yielded remarkable breakthroughs in reasoning models and generative vision models.<n>We propose one solution to the problem of integrating test-time scaling knowledge into a model during post-training.<n>We replace reward guided test-time noise optimization in diffusion models with a Noise Hypernetwork that modulates initial input noise.
arXiv Detail & Related papers (2025-08-13T17:33:37Z)
The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training [55.233765889424035]
We show that learning-rate schedules for large model training behave surprisingly similar to a convex bound from non-smooth optimization theory.<n>We achieve noticeable improvements for training 124M and 210M Llama-type models by (i) extending the schedule for continued training with optimal learning-rate, and (ii) transferring the optimal learning-rate across schedules.
arXiv Detail & Related papers (2025-01-31T08:55:56Z)
The Road Less Scheduled [45.01813613035411]
Existing learning rate schedules that do not require specification of the optimization stopping step T are greatly out-performed by learning rate schedules that depend on T. We propose an approach that avoids the need for this stopping time by eschewing the use of schedules entirely. Our Schedule-Free approach introduces no additional hyper- parameters over standard schedules with momentum.
arXiv Detail & Related papers (2024-05-24T16:20:46Z)
Sparse is Enough in Fine-tuning Pre-trained Large Language Models [98.46493578509039]
We propose a gradient-based sparse fine-tuning algorithm, named Sparse Increment Fine-Tuning (SIFT) We validate its effectiveness on a range of tasks including the GLUE Benchmark and Instruction-tuning.
arXiv Detail & Related papers (2023-12-19T06:06:30Z)
Optimal Linear Decay Learning Rate Schedules and Further Refinements [46.79573408189601]
Learning rate schedules used in practice bear little resemblance to those recommended by theory. We close much of this theory/practice gap, and as a consequence are able to derive new problem-adaptive learning rate schedules.
arXiv Detail & Related papers (2023-10-11T19:16:35Z)
Eigencurve: Optimal Learning Rate Schedule for SGD on Quadratic Objectives with Skewed Hessian Spectrums [26.44093918424658]
Eigencurve is the first family of learning rate schedules that can achieve minimax optimal convergence rates (up to a constant) for SGD on quadratic objectives. Experimental results show that Eigencurve can significantly outperform step decay in image classification tasks. Two simple learning rate schedulers for practical applications can approximate Eigencurve.
arXiv Detail & Related papers (2021-10-27T01:17:53Z)
Support recovery and sup-norm convergence rates for sparse pivotal estimation [79.13844065776928]
In high dimensional sparse regression, pivotal estimators are estimators for which the optimal regularization parameter is independent of the noise level. We show minimax sup-norm convergence rates for non smoothed and smoothed, single task and multitask square-root Lasso-type estimators.
arXiv Detail & Related papers (2020-01-15T16:11:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.