Hyperparameter Transfer with Mixture-of-Expert Layers
- URL: http://arxiv.org/abs/2601.20205v1
- Date: Wed, 28 Jan 2026 03:02:30 GMT
- Title: Hyperparameter Transfer with Mixture-of-Expert Layers
- Authors: Tianze Jiang, Blake Bordelon, Cengiz Pehlevan, Boris Hanin,
- Abstract summary: Mixture-of-Experts (MoE) layers have emerged as an important tool in scaling up modern neural networks.<n>We propose a new parameterization for transformer models with MoE layers when scaling model width, depth, number of experts, and expert (hidden) size.
- Score: 51.03005470884366
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Mixture-of-Experts (MoE) layers have emerged as an important tool in scaling up modern neural networks by decoupling total trainable parameters from activated parameters in the forward pass for each token. However, sparse MoEs add complexity to training due to (i) new trainable parameters (router weights) that, like all other parameter groups, require hyperparameter (HP) tuning; (ii) new architecture scale dimensions (number of and size of experts) that must be chosen and potentially taken large. To make HP selection cheap and reliable, we propose a new parameterization for transformer models with MoE layers when scaling model width, depth, number of experts, and expert (hidden) size. Our parameterization is justified by a novel dynamical mean-field theory (DMFT) analysis. When varying different model dimensions trained at a fixed token budget, we find empirically that our parameterization enables reliable HP transfer across models from 51M to over 2B total parameters. We further take HPs identified from sweeping small models on a short token horizon to train larger models on longer horizons and report performant model behaviors.
Related papers
- High-Rank Structured Modulation for Parameter-Efficient Fine-Tuning [57.85676271833619]
Low-rank Adaptation (LoRA) uses a low-rank update method to simulate full parameter fine-tuning.<n>We present textbfSMoA, a high-rank textbfStructured textbfMOdulation textbfAdapter that uses fewer trainable parameters while maintaining a higher rank.
arXiv Detail & Related papers (2026-01-12T13:06:17Z) - Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration [40.02031646222292]
We show how to search for optimal global base hyperparameters at a small model size, and transfer to a large size.<n>Our experiments demonstrate significant training speed improvements in Large Language Models.
arXiv Detail & Related papers (2025-12-26T20:56:04Z) - $μ$-Parametrization for Mixture of Experts [8.950722808523981]
Mixture-of-Experts (MoE) are emerging as a leading architecture in extremely large models.<n>$mu$Transfer allows seamless transfer of optimal hyper parameters across model scales.<n>Experiments demonstrate that the optimal learning rate reliably transfers across model sizes.
arXiv Detail & Related papers (2025-08-13T12:31:27Z) - Sparsity May Be All You Need: Sparse Random Parameter Adaptation [7.479026959617763]
Full fine-tuning of large language models for alignment and task adaptation has become prohibitively expensive as models have grown in size.<n>We propose a novel way to reduce the computational and memory resources needed for fine-tuning these models by only training on a small number of parameters instead of all model parameters.<n>Our findings suggest that what truly matters for a PEFT technique to perform well is not necessarily the specific adapter structure, but rather the number of trainable parameters being used.
arXiv Detail & Related papers (2025-02-21T22:23:16Z) - QuIC: Quantum-Inspired Compound Adapters for Parameter Efficient Fine-Tuning [0.0]
Scaling full finetuning of large foundation models strains GPU memory and training time.<n>We introduce Quantum-Inspired Compound Adapters (QuIC Adapters)<n>QuIC adapters can effectively finetune a model using less than 0.02% memory footprint of the base model.
arXiv Detail & Related papers (2025-02-10T13:06:56Z) - SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction From Pre-Trained Foundation Models [85.67096251281191]
We present an innovative approach to model fusion called zero-shot Sparse MIxture of Low-rank Experts (SMILE) construction.
SMILE allows for the upscaling of source models into an MoE model without extra data or further training.
We conduct extensive experiments across diverse scenarios, such as image classification and text generation tasks, using full fine-tuning and LoRA fine-tuning.
arXiv Detail & Related papers (2024-08-19T17:32:15Z) - Scaling Exponents Across Parameterizations and Optimizers [94.54718325264218]
We propose a new perspective on parameterization by investigating a key assumption in prior work.
Our empirical investigation includes tens of thousands of models trained with all combinations of threes.
We find that the best learning rate scaling prescription would often have been excluded by the assumptions in prior work.
arXiv Detail & Related papers (2024-07-08T12:32:51Z) - Parameter Efficient Fine-tuning via Cross Block Orchestration for Segment Anything Model [81.55141188169621]
We equip PEFT with a cross-block orchestration mechanism to enable the adaptation of the Segment Anything Model (SAM) to various downstream scenarios.
We propose an intra-block enhancement module, which introduces a linear projection head whose weights are generated from a hyper-complex layer.
Our proposed approach consistently improves the segmentation performance significantly on novel scenarios with only around 1K additional parameters.
arXiv Detail & Related papers (2023-11-28T11:23:34Z) - Understanding Parameter Sharing in Transformers [53.75988363281843]
Previous work on Transformers has focused on sharing parameters in different layers, which can improve the performance of models with limited parameters by increasing model depth.
We show that the success of this approach can be largely attributed to better convergence, with only a small part due to the increased model complexity.
Experiments on 8 machine translation tasks show that our model achieves competitive performance with only half the model complexity of parameter sharing models.
arXiv Detail & Related papers (2023-06-15T10:48:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.