Related papers: Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration

Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration

URL: http://arxiv.org/abs/2512.22382v1
Date: Fri, 26 Dec 2025 20:56:04 GMT
Title: Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration
Authors: Bruno Mlodozeniec, Pierre Ablin, Louis Béthune, Dan Busbridge, Michal Klein, Jason Ramapuram, Marco Cuturi,
Abstract summary: We show how to search for optimal global base hyperparameters at a small model size, and transfer to a large size.<n>Our experiments demonstrate significant training speed improvements in Large Language Models.
Score: 40.02031646222292
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Hyperparameter tuning can dramatically impact training stability and final performance of large-scale models. Recent works on neural network parameterisations, such as $μ$P, have enabled transfer of optimal global hyperparameters across model sizes. These works propose an empirical practice of search for optimal global base hyperparameters at a small model size, and transfer to a large size. We extend these works in two key ways. To handle scaling along most important scaling axes, we propose the Complete$^{(d)}$ Parameterisation that unifies scaling in width and depth -- using an adaptation of CompleteP -- as well as in batch-size and training duration. Secondly, with our parameterisation, we investigate per-module hyperparameter optimisation and transfer. We characterise the empirical challenges of navigating the high-dimensional hyperparameter landscape, and propose practical guidelines for tackling this optimisation problem. We demonstrate that, with the right parameterisation, hyperparameter transfer holds even in the per-module hyperparameter regime. Our study covers an extensive range of optimisation hyperparameters of modern models: learning rates, AdamW parameters, weight decay, initialisation scales, and residual block multipliers. Our experiments demonstrate significant training speed improvements in Large Language Models with the transferred per-module hyperparameters.

Related papers

Hyperparameter Transfer with Mixture-of-Expert Layers [51.03005470884366]
Mixture-of-Experts (MoE) layers have emerged as an important tool in scaling up modern neural networks.<n>We propose a new parameterization for transformer models with MoE layers when scaling model width, depth, number of experts, and expert (hidden) size.
arXiv Detail & Related papers (2026-01-28T03:02:30Z)
High-Rank Structured Modulation for Parameter-Efficient Fine-Tuning [57.85676271833619]
Low-rank Adaptation (LoRA) uses a low-rank update method to simulate full parameter fine-tuning.<n>We present textbfSMoA, a high-rank textbfStructured textbfMOdulation textbfAdapter that uses fewer trainable parameters while maintaining a higher rank.
arXiv Detail & Related papers (2026-01-12T13:06:17Z)
Predictable Scale: Part I, Step Law -- Optimal Hyperparameter Scaling Law in Large Language Model Pretraining [59.369484219304866]
We conduct an unprecedented empirical investigation training over 3,700 Large Language Models (LLMs) from scratch across 100 trillion tokens.<n>We establish a universal Scaling Law for hyperparameter optimization in LLM Pre-training, called Step Law.<n>Our estimated optima deviates from the global best performance found via exhaustive search by merely 0.094% on the test set.
arXiv Detail & Related papers (2025-03-06T18:58:29Z)
Efficient Hyperparameter Importance Assessment for CNNs [1.7778609937758323]
This paper aims to quantify the importance weights of some hyperparameters in Convolutional Neural Networks (CNNs) with an algorithm called N-RReliefF. We conduct an extensive study by training over ten thousand CNN models across ten popular image classification datasets.
arXiv Detail & Related papers (2024-10-11T15:47:46Z)
Scaling Exponents Across Parameterizations and Optimizers [94.54718325264218]
We propose a new perspective on parameterization by investigating a key assumption in prior work. Our empirical investigation includes tens of thousands of models trained with all combinations of threes. We find that the best learning rate scaling prescription would often have been excluded by the assumptions in prior work.
arXiv Detail & Related papers (2024-07-08T12:32:51Z)
ETHER: Efficient Finetuning of Large-Scale Models with Hyperplane Reflections [59.839926875976225]
We propose the ETHER transformation family, which performs Efficient fineTuning via HypErplane Reflections. In particular, we introduce ETHER and its relaxation ETHER+, which match or outperform existing PEFT methods with significantly fewer parameters.
arXiv Detail & Related papers (2024-05-30T17:26:02Z)
On the Parameterization of Second-Order Optimization Effective Towards the Infinite Width [5.217870815854702]
We identify a specific parameterization for second-order optimization that promotes feature learning in a stable manner. Inspired by a maximal update parameterization, we consider a one-step update of the gradient. Our approach covers two major second-order optimization algorithms, K-FAC and Shampoo.
arXiv Detail & Related papers (2023-12-19T15:12:39Z)
AUTOMATA: Gradient Based Data Subset Selection for Compute-Efficient Hyper-parameter Tuning [72.54359545547904]
We propose a gradient-based subset selection framework for hyper- parameter tuning. We show that using gradient-based data subsets for hyper- parameter tuning achieves significantly faster turnaround times and speedups of 3$times$-30$times$.
arXiv Detail & Related papers (2022-03-15T19:25:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.