Related papers: The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training

The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training

URL: http://arxiv.org/abs/2501.18965v1
Date: Fri, 31 Jan 2025 08:55:56 GMT
Title: The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training
Authors: Fabian Schaipp, Alexander Hägele, Adrien Taylor, Umut Simsekli, Francis Bach,
Abstract summary: We show that learning-rate schedules for large model training behave surprisingly similar to a convex bound from non-smooth optimization theory.<n>We achieve noticeable improvements for training 124M and 210M Llama-type models by (i) extending the schedule for continued training with optimal learning-rate, and (ii) transferring the optimal learning-rate across schedules.
Score: 55.233765889424035
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We show that learning-rate schedules for large model training behave surprisingly similar to a performance bound from non-smooth convex optimization theory. We provide a bound for the constant schedule with linear cooldown; in particular, the practical benefit of cooldown is reflected in the bound due to the absence of logarithmic terms. Further, we show that this surprisingly close match between optimization theory and practice can be exploited for learning-rate tuning: we achieve noticeable improvements for training 124M and 210M Llama-type models by (i) extending the schedule for continued training with optimal learning-rate, and (ii) transferring the optimal learning-rate across schedules.

Related papers

Dynamic Learning Rate Scheduling based on Loss Changes Leads to Faster Convergence [2.1665689529884697]
emphGreedyLR is a novel scheduler that adaptively adjusts the learning rate during training based on the current loss.<n>Our approach outperforms several state-of-the-art schedulers in terms of accuracy, speed, and convergence.
arXiv Detail & Related papers (2025-12-16T16:03:52Z)
Scaling and Transferability of Annealing Strategies in Large Language Model Training [59.443651879173025]
We refine a predictive framework for optimizing annealing strategies under the Warmup-Steady-Decay (WSD) scheduler.<n>Our improved framework incorporates training steps, maximum learning rate, and annealing behavior, enabling more efficient optimization of learning rate schedules.<n>We validate our findings on extensive experiments using both Dense and Mixture-of-Experts (MoE) models.
arXiv Detail & Related papers (2025-12-05T16:38:33Z)
Learning Rate Scheduling with Matrix Factorization for Private Training [4.726777092009554]
We study differentially private model training with gradient descent under learning rate scheduling and correlated noise.<n>We propose a learning-rate-aware factorization that achieves improvements over prefix-sum factorizations under both MaxSE and MeanSE error metrics.
arXiv Detail & Related papers (2025-11-22T09:24:45Z)
A Trainable Optimizer [18.195022468462753]
We present a framework that jointly trains the full gradient estimator and the trainable weights of the model.<n>Pseudo-linear TO incurs negligible computational overhead, requiring only minimal additional multiplications.<n> Experiments demonstrate that TO methods converge faster than benchmark algorithms.
arXiv Detail & Related papers (2025-08-03T14:06:07Z)
AdaLRS: Loss-Guided Adaptive Learning Rate Search for Efficient Foundation Model Pretraining [12.630306478872043]
We propose textbfAdaLRS, a plug-in-and-play adaptive learning rate search algorithm that conducts online optimal learning rate search.<n>Experiments show that AdaLRS adjusts suboptimal learning rates to the neighborhood of optimum with marked efficiency and effectiveness.
arXiv Detail & Related papers (2025-06-16T09:14:01Z)
Self-Boost via Optimal Retraining: An Analysis via Approximate Message Passing [58.52119063742121]
Retraining a model using its own predictions together with the original, potentially noisy labels is a well-known strategy for improving the model performance.<n>This paper addresses the question of how to optimally combine the model's predictions and the provided labels.<n>Our main contribution is the derivation of the Bayes optimal aggregator function to combine the current model's predictions and the given labels.
arXiv Detail & Related papers (2025-05-21T07:16:44Z)
Better Schedules for Low Precision Training of Deep Neural Networks [13.88763215392452]
cyclic precision training (CPT) dynamically adjusts precision throughout training according to a cyclic schedule. CPT achieves particularly impressive improvements in training efficiency, while actually improving DNN performance.
arXiv Detail & Related papers (2024-03-04T17:33:39Z)
Navigating Scaling Laws: Compute Optimality in Adaptive Model Training [39.96209967632896]
In recent years, the state-of-the-art in deep learning has been dominated by very large models that have been pre-trained on vast amounts of data. We extend the concept of optimality by allowing for an adaptive' model, i.e. a model that can change its shape during training.
arXiv Detail & Related papers (2023-11-06T16:20:28Z)
Optimal Linear Decay Learning Rate Schedules and Further Refinements [46.79573408189601]
Learning rate schedules used in practice bear little resemblance to those recommended by theory. We close much of this theory/practice gap, and as a consequence are able to derive new problem-adaptive learning rate schedules.
arXiv Detail & Related papers (2023-10-11T19:16:35Z)
Towards Compute-Optimal Transfer Learning [82.88829463290041]
We argue that zero-shot structured pruning of pretrained models allows them to increase compute efficiency with minimal reduction in performance. Our results show that pruning convolutional filters of pretrained models can lead to more than 20% performance improvement in low computational regimes.
arXiv Detail & Related papers (2023-04-25T21:49:09Z)
Self-Supervised Primal-Dual Learning for Constrained Optimization [19.965556179096385]
This paper studies how to train machine-learning models that directly approximate the optimal solutions of constrained optimization problems. It proposes the idea of Primal-Dual Learning (PDL), a self-supervised training method that does not require a set of pre-solved instances or an optimization solver for training and inference.
arXiv Detail & Related papers (2022-08-18T20:07:10Z)
Optimization-Derived Learning with Essential Convergence Analysis of Training and Hyper-training [52.39882976848064]
We design a Generalized Krasnoselskii-Mann (GKM) scheme based on fixed-point iterations as our fundamental ODL module. Under the GKM scheme, a Bilevel Meta Optimization (BMO) algorithmic framework is constructed to solve the optimal training and hyper-training variables together.
arXiv Detail & Related papers (2022-06-16T01:50:25Z)
Fast Rates for Contextual Linear Optimization [52.39202699484225]
We show that a naive plug-in approach achieves regret convergence rates that are significantly faster than methods that directly optimize downstream decision performance. Our results are overall positive for practice: predictive models are easy and fast to train using existing tools, simple to interpret, and, as we show, lead to decisions that perform very well.
arXiv Detail & Related papers (2020-11-05T18:43:59Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
The Two Regimes of Deep Network Training [93.84309968956941]
We study the effects of different learning schedules and the appropriate way to select them. To this end, we isolate two distinct phases, which we refer to as the "large-step regime" and the "small-step regime" Our training algorithm can significantly simplify learning rate schedules.
arXiv Detail & Related papers (2020-02-24T17:08:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.