Related papers: Towards Robust Scaling Laws for Optimizers

Towards Robust Scaling Laws for Optimizers

URL: http://arxiv.org/abs/2602.07712v1
Date: Sat, 07 Feb 2026 21:40:33 GMT
Title: Towards Robust Scaling Laws for Optimizers
Authors: Alexandra Volkova, Mher Safaryan, Christoph H. Lampert, Dan Alistarh,
Abstract summary: Empirical scaling laws are widely used to predict loss as model size and training data grow.<n>We show that Chinchilla-style scaling laws emerge naturally as a result of loss decomposition into irreducible, approximation, and optimization errors.
Score: 89.21160945066737
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The quality of Large Language Model (LLM) pretraining depends on multiple factors, including the compute budget and the choice of optimization algorithm. Empirical scaling laws are widely used to predict loss as model size and training data grow, however, almost all existing studies fix the optimizer (typically AdamW). At the same time, a new generation of optimizers (e.g., Muon, Shampoo, SOAP) promises faster and more stable convergence, but their relationship with model and data scaling is not yet well understood. In this work, we study scaling laws across different optimizers. Empirically, we show that 1) separate Chinchilla-style scaling laws for each optimizer are ill-conditioned and have highly correlated parameters. Instead, 2) we propose a more robust law with shared power-law exponents and optimizer-specific rescaling factors, which enable direct comparison between optimizers. Finally, 3) we provide a theoretical analysis of gradient-based methods for the proxy task of a convex quadratic objective, demonstrating that Chinchilla-style scaling laws emerge naturally as a result of loss decomposition into irreducible, approximation, and optimization errors.

Related papers

Hyperparameter Transfer Enables Consistent Gains of Matrix-Preconditioned Optimizers Across Scales [55.91454326946738]
We study how the optimal learning rate and weight decay should scale with model width and depth for a wide range of languages.<n>We find that scaling the learning rate according to $$P improves transfer, but can still suffer from significant finite-width deviations.<n>For compute-optimal scaling, we find scaling independent weight decay as $1/mathrmwidth$ is nearly optimal across languages.
arXiv Detail & Related papers (2025-12-05T11:03:41Z)
Near-optimal Linear Predictive Clustering in Non-separable Spaces via Mixed Integer Programming and Quadratic Pseudo-Boolean Reductions [21.80447518126464]
Linear Predictive Clustering (LPC) partitions samples based on shared linear relationships between feature and target variables.<n>Greedy optimization methods, commonly used for LPC, alternate between clustering and linear regression but lack global optimality.<n>This work builds on the constrained optimization paradigm to introduce two novel approaches that improve the efficiency of global optimization for LPC.
arXiv Detail & Related papers (2025-11-13T21:22:47Z)
Bilevel Learning via Inexact Stochastic Gradient Descent [5.312803257246881]
Bilevel optimization is a central tool in machine learning for high-dimensional hyper tuning.<n>We advance the theory of inexact bilevel optimization.<n>We prove convergence and establish rates under decaying accuracy and step size schedules.
arXiv Detail & Related papers (2025-11-10T07:02:52Z)
A Convergence Analysis of Adaptive Optimizers under Floating-point Quantization [32.97211471008323]
We introduce the first theoretical framework of adaptive convergences, including Adam and Muon, under floating-point quantization of gradients, weights, and states.<n>We show that both algorithms retain convergence rates close to their full-precision counterparts provided mantissa length scales only logarithmically with the number of iterations.<n>Our analysis further reveals that Adam is highly sensitive to and second-moment quantization weights due to its reliance on $beta to 1$, while Muon requires weaker error control and is thus potentially more robust.
arXiv Detail & Related papers (2025-10-24T10:16:23Z)
Make Optimization Once and for All with Fine-grained Guidance [78.14885351827232]
Learning to Optimize (L2O) enhances optimization efficiency with integrated neural networks.<n>L2O paradigms achieve great outcomes, e.g., refitting, generating unseen solutions iteratively or directly.<n>Our analyses explore general framework for learning optimization, called Diff-L2O, focusing on augmenting solutions from a wider view.
arXiv Detail & Related papers (2025-03-14T14:48:12Z)
Surge Phenomenon in Optimal Learning Rate and Batch Size Scaling [27.058009599819012]
We study the connection between optimal learning rates and batch sizes for Adam styles. We prove that the optimal learning rate first rises and then falls as the batch size increases.
arXiv Detail & Related papers (2024-05-23T13:52:36Z)
Unified Convergence Analysis for Adaptive Optimization with Moving Average Estimator [75.05106948314956]
We show that an increasing large momentum parameter for the first-order moment is sufficient for adaptive scaling.<n>We also give insights for increasing the momentum in a stagewise manner in accordance with stagewise decreasing step size.
arXiv Detail & Related papers (2021-04-30T08:50:24Z)
Balancing Rates and Variance via Adaptive Batch-Size for Stochastic Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error. Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z)
Global Optimization of Gaussian processes [52.77024349608834]
We propose a reduced-space formulation with trained Gaussian processes trained on few data points. The approach also leads to significantly smaller and computationally cheaper sub solver for lower bounding. In total, we reduce time convergence by orders of orders of the proposed method.
arXiv Detail & Related papers (2020-05-21T20:59:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.