Fantastic Pretraining Optimizers and Where to Find Them
- URL: http://arxiv.org/abs/2509.02046v2
- Date: Thu, 04 Sep 2025 19:22:04 GMT
- Title: Fantastic Pretraining Optimizers and Where to Find Them
- Authors: Kaiyue Wen, David Hall, Tengyu Ma, Percy Liang,
- Abstract summary: AdamW has long been the dominant gradients in language model pretraining.<n>Speedup of matrix-based matrices is inversely proportional to model scale.
- Score: 59.56075036649332
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: AdamW has long been the dominant optimizer in language model pretraining, despite numerous claims that alternative optimizers offer 1.4 to 2x speedup. We posit that two methodological shortcomings have obscured fair comparisons and hindered practical adoption: (i) unequal hyperparameter tuning and (ii) limited or misleading evaluation setups. To address these two issues, we conduct a systematic study of ten deep learning optimizers across four model scales (0.1B-1.2B parameters) and data-to-model ratios (1-8x the Chinchilla optimum). We find that fair and informative comparisons require rigorous hyperparameter tuning and evaluations across a range of model scales and data-to-model ratios, performed at the end of training. First, optimal hyperparameters for one optimizer may be suboptimal for another, making blind hyperparameter transfer unfair. Second, the actual speedup of many proposed optimizers over well-tuned baselines is lower than claimed and decreases with model size to only 1.1x for 1.2B parameter models. Thirdly, comparing intermediate checkpoints before reaching the target training budgets can be misleading, as rankings between two optimizers can flip during training due to learning rate decay. Through our thorough investigation, we find that all the fastest optimizers such as Muon and Soap, use matrices as preconditioners -- multiplying gradients with matrices rather than entry-wise scalars. However, the speedup of matrix-based optimizers is inversely proportional to model scale, decreasing from 1.4x over AdamW for 0.1B parameter models to merely 1.1x for 1.2B parameter models.
Related papers
- Hyperparameter Transfer Enables Consistent Gains of Matrix-Preconditioned Optimizers Across Scales [55.91454326946738]
We study how the optimal learning rate and weight decay should scale with model width and depth for a wide range of languages.<n>We find that scaling the learning rate according to $$P improves transfer, but can still suffer from significant finite-width deviations.<n>For compute-optimal scaling, we find scaling independent weight decay as $1/mathrmwidth$ is nearly optimal across languages.
arXiv Detail & Related papers (2025-12-05T11:03:41Z) - CAST: Continuous and Differentiable Semi-Structured Sparsity-Aware Training for Large Language Models [27.682531424487564]
Sparsity-aware training is an effective approach for transforming large language models into hardware-friendly sparse patterns.<n>We propose Continuous Adaptive Sparse Trainer (CAST), a continuous and differentiable sparsity-aware training framework for sparse models.<n>Our results demonstrate significant improvements over previous state-of-the-art methods in both perplexity and zero-shot accuracy with minimal training resources.
arXiv Detail & Related papers (2025-09-30T09:28:47Z) - Beyond Outliers: A Study of Optimizers Under Quantization [82.75879062804955]
We study impact of choice on model robustness under quantization.<n>We evaluate how model performance degrades when trained with different baselines.<n>We derive scaling laws for quantization-aware training under different parameters.
arXiv Detail & Related papers (2025-09-27T21:15:22Z) - Heart Disease Prediction: A Comparative Study of Optimisers Performance in Deep Neural Networks [0.0]
We compare the performance of 10 different approaches in training a simple Multi-layer Perceptron model using a heart disease dataset from Kaggle.<n>We set up a consistent training paradigm and evaluate the metrics based on metrics such as convergence speed and stability.<n>Across all our metrics, we chose RMSProp to be the most effective for this heart disease prediction task because it offered a balanced performance across key metrics.
arXiv Detail & Related papers (2025-09-10T11:15:44Z) - Scaling Exponents Across Parameterizations and Optimizers [94.54718325264218]
We propose a new perspective on parameterization by investigating a key assumption in prior work.
Our empirical investigation includes tens of thousands of models trained with all combinations of threes.
We find that the best learning rate scaling prescription would often have been excluded by the assumptions in prior work.
arXiv Detail & Related papers (2024-07-08T12:32:51Z) - Surge Phenomenon in Optimal Learning Rate and Batch Size Scaling [27.058009599819012]
We study the connection between optimal learning rates and batch sizes for Adam styles.
We prove that the optimal learning rate first rises and then falls as the batch size increases.
arXiv Detail & Related papers (2024-05-23T13:52:36Z) - AdaLomo: Low-memory Optimization with Adaptive Learning Rate [59.64965955386855]
We introduce low-memory optimization with adaptive learning rate (AdaLomo) for large language models.
AdaLomo results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
arXiv Detail & Related papers (2023-10-16T09:04:28Z) - Fine-Tuning Language Models with Just Forward Passes [92.04219196752007]
Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a large amount of memory.
We propose a memory-efficient zerothorder (MeZO) to operate in-place, thereby fine-tuning LMs with the same memory footprint as inference.
arXiv Detail & Related papers (2023-05-27T02:28:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.