Where Does Warm-Up Come From? Adaptive Scheduling for Norm-Constrained Optimizers
- URL: http://arxiv.org/abs/2602.05813v1
- Date: Thu, 05 Feb 2026 16:06:19 GMT
- Title: Where Does Warm-Up Come From? Adaptive Scheduling for Norm-Constrained Optimizers
- Authors: Artem Riabinin, Andrey Veprikov, Arman Bolatov, Martin Takáč, Aleksandr Beznosikov,
- Abstract summary: We develop a practical learning rate scheduler that adapts the warm-up duration automatically at the beginning of training.<n>We evaluate this method on large language model pretraining with LLaMA architectures and show that our adaptive warm-up selection consistently outperforms or at least matches the best manually tuned warm-up schedules.
- Score: 43.838677595865846
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study adaptive learning rate scheduling for norm-constrained optimizers (e.g., Muon and Lion). We introduce a generalized smoothness assumption under which local curvature decreases with the suboptimality gap and empirically verify that this behavior holds along optimization trajectories. Under this assumption, we establish convergence guarantees under an appropriate choice of learning rate, for which warm-up followed by decay arises naturally from the proof rather than being imposed heuristically. Building on this theory, we develop a practical learning rate scheduler that relies only on standard hyperparameters and adapts the warm-up duration automatically at the beginning of training. We evaluate this method on large language model pretraining with LLaMA architectures and show that our adaptive warm-up selection consistently outperforms or at least matches the best manually tuned warm-up schedules across all considered setups, without additional hyperparameter search. Our source code is available at https://github.com/brain-lab-research/llm-baselines/tree/warmup
Related papers
- Positive-Unlabeled Reinforcement Learning Distillation for On-Premise Small Models [130.8912476550625]
We propose a positive-unlabeled (PU) reinforcement learning distillation method for on-premise small-model deployment.<n>Our method distills the teacher's preference-optimization capability from black-box generations into a locally trainable student.<n>Experiments demonstrate that our method achieves consistently strong performance under a low-cost setting.
arXiv Detail & Related papers (2026-01-28T15:14:50Z) - Beyond Freezing: Sparse Tuning Enhances Plasticity in Continual Learning with Pre-Trained Models [10.904981532789824]
Continual Learning with Pre-trained Models holds great promise for efficient adaptation across sequential tasks.<n>Existing approaches freeze PTMs and rely on auxiliary modules like prompts or adapters.<n>We propose Mutual Information-guided Sparse Tuning (MIST), a plug-and-play method that selectively updates a small subset of PTM parameters.
arXiv Detail & Related papers (2025-05-26T13:09:25Z) - Training Deep Learning Models with Norm-Constrained LMOs [56.00317694850397]
We propose a new family of algorithms that uses the linear minimization oracle (LMO) to adapt to the geometry of the problem.<n>We demonstrate significant speedups on nanoGPT training using our algorithm, Scion, without any reliance on Adam.
arXiv Detail & Related papers (2025-02-11T13:10:34Z) - Adaptive Decoding via Latent Preference Optimization [55.70602730588745]
We introduce Adaptive Decoding, a layer added to the model to select the sampling temperature dynamically at inference time.
Our method outperforms all fixed decoding temperatures across a range of tasks that require different temperatures.
arXiv Detail & Related papers (2024-11-14T18:31:39Z) - Understanding Optimization in Deep Learning with Central Flows [95.5647720254338]
We develop theory that can describe the dynamics of optimization in a complex regime.<n>Our results suggest that central flows can be a valuable theoretical tool for reasoning about optimization in deep learning.
arXiv Detail & Related papers (2024-10-31T17:58:13Z) - Adaptive Gradient Methods with Local Guarantees [48.980206926987606]
We propose an adaptive gradient method that has provable adaptive regret guarantees vs. the best local preconditioner.
We demonstrate the robustness of our method in automatically choosing the optimal learning rate schedule for popular benchmarking tasks in vision and language domains.
arXiv Detail & Related papers (2022-03-02T20:45:14Z) - Self-Tuning Stochastic Optimization with Curvature-Aware Gradient
Filtering [53.523517926927894]
We explore the use of exact per-sample Hessian-vector products and gradients to construct self-tuning quadratics.
We prove that our model-based procedure converges in noisy gradient setting.
This is an interesting step for constructing self-tuning quadratics.
arXiv Detail & Related papers (2020-11-09T22:07:30Z) - Automatic Tuning of Stochastic Gradient Descent with Bayesian
Optimisation [8.340191147575307]
We introduce an original probabilistic model for traces of optimisers, based on latent Gaussian processes and an auto-/regressive formulation.
It flexibly adjusts to abrupt changes of behaviours induced by new learning rate values.
It is well-suited to tackle a set of problems: first, for the on-line adaptation of the learning rate for a cold-started run; then, for tuning the schedule for a set of similar tasks, as well as warm-starting it for a new task.
arXiv Detail & Related papers (2020-06-25T13:18:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.