Related papers: Continual Pre-training of MoEs: How robust is your router?

Continual Pre-training of MoEs: How robust is your router?

URL: http://arxiv.org/abs/2503.05029v2
Date: Mon, 10 Nov 2025 05:32:48 GMT
Title: Continual Pre-training of MoEs: How robust is your router?
Authors: Benjamin Thérien, Charles-Étienne Joseph, Zain Sarwar, Ashwinee Panda, Anirban Das, Shi-Xiong Zhang, Stephen Rawls, Sambit Sahu, Eugene Belilovsky, Irina Rish,
Abstract summary: Mixture of Experts (MoE) transformers are promising architectures for foundation models.<n>MoEs benefit from improved sample efficiency at training time and achieve much stronger performance.<n>We show that MoEs can match the performance of a fully re-trained MoE at a fraction of the cost.
Score: 31.784662011106196
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Sparsely-activated Mixture of Experts (MoE) transformers are promising architectures for foundation models. Compared to dense transformers that require the same amount of floating-point operations (FLOPs) per forward pass, MoEs benefit from improved sample efficiency at training time and achieve much stronger performance. Many closed-source and open-source frontier language models have thus adopted an MoE architecture. Naturally, practitioners will want to extend the capabilities of these models with large amounts of newly collected data without completely re-training them. Prior work has shown that a simple combination of replay, learning rate re-warming, and re-decaying can enable the continual pre-training (CPT) of dense decoder-only transformers with minimal performance degradation compared to full re-training. In the case of decoder-only MoE transformers, however, it is unclear how the routing algorithm will impact continual pre-training performance: 1) do the MoE transformer's routers exacerbate forgetting relative to a dense model?; 2) do the routers maintain a balanced load on previous distributions after CPT?; 3) are the same strategies applied to dense models sufficient to continually pre-train MoE LLMs? In what follows, we conduct a large-scale study training a 500M parameter dense transformer and four 500M-active/2B-total parameter MoE transformers. Each model is trained for 600B tokens. Our results establish a surprising robustness to distribution shifts for MoEs using both Sinkhorn-Balanced and Z-and-Aux-loss-balanced routing algorithms, even in MoEs continually pre-trained without replay. Moreover, we show that MoE LLMs maintain their sample efficiency (relative to a FLOP-matched dense model) during CPT and that they can match the performance of a fully re-trained MoE at a fraction of the cost.

Related papers

MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production [25.988476402301277]
We present MegaScale-MoE, a production system tailored for the efficient training of large-scale mixture-of-experts (MoE) models.<n>MegaScale-MoE customizes communication-efficient strategies for attention and FFNs in each MoE layer.<n>MegaScale-MoE achieves a training throughput of 1.41M tokens/s, improving the efficiency by 1.88$times$ compared to Megatron-LM.
arXiv Detail & Related papers (2025-05-16T16:52:16Z)
Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs [111.69640966866059]
Sparse large language models (LLMs) with Mixture of Experts (MoE) and close to a trillion parameters are dominating the realm of most capable language models.<n>In this paper, we aim to uncover a recipe to harness such scale on Ascend NPUs.<n>The key goals are better usage of the computing resources under the dynamic sparse model structures and materializing the expected performance gain on the actual hardware.
arXiv Detail & Related papers (2025-05-07T15:46:36Z)
Dense Backpropagation Improves Training for Sparse Mixture-of-Experts [41.08173926456885]
We present a lightweight approximation method that gives the MoE router a dense gradient update while continuing to sparsely activate its parameters. Our Default MoE outperforms standard TopK routing in a variety of settings without requiring significant computational overhead.
arXiv Detail & Related papers (2025-04-16T19:55:36Z)
ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration [61.579842548990754]
Mixture-of-Experts (MoE) Transformer, the backbone of multiple phenomenal language models, leverages sparsity by activating only a fraction of model parameters for each input token. We introduce ResMoE, an innovative MoE approximation framework that utilizes Wasserstein barycenter to extract a common expert (barycenter expert) and approximate the residuals between this barycenter expert and the original ones.
arXiv Detail & Related papers (2025-03-10T03:15:54Z)
Mixed Sparsity Training: Achieving 4$\times$ FLOP Reduction for Transformer Pretraining [32.925150708409205]
Mixed Sparsity Training (MST) is an efficient pretraining method that can reduce about $75%$ of Floating Point Operations (FLOPs) while maintaining performance. Our experiment on GPT-2 showcases a FLOP reduction of $4times$ without compromising performance.
arXiv Detail & Related papers (2024-08-21T16:13:16Z)
LaDiMo: Layer-wise Distillation Inspired MoEfier [1.6199400106794555]
We propose a novel algorithm, LaDiMo, which efficiently converts a Transformer-based non-MoE model into a MoE model with minimal additional training cost. We demonstrate the effectiveness of our method by converting the LLaMA2-7B model to a MoE model using only 100K tokens.
arXiv Detail & Related papers (2024-08-08T07:37:26Z)
FBI-LLM: Scaling Up Fully Binarized LLMs from Scratch via Autoregressive Distillation [32.01836613286288]
This work presents a Fully BInarized Large Language Model (FBI-LLM) It demonstrates for the first time how to train a large-scale binary language model from scratch.
arXiv Detail & Related papers (2024-07-09T17:59:48Z)
MoEUT: Mixture-of-Experts Universal Transformers [75.96744719516813]
Universal Transformers (UTs) have advantages over standard Transformers in learning compositional generalizations. Layer-sharing drastically reduces the parameter count compared to the non-shared model with the same dimensionality. No previous work has succeeded in proposing a shared-layer Transformer design that is competitive in parameter count-dominated tasks such as language modeling.
arXiv Detail & Related papers (2024-05-25T03:24:32Z)
Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models [62.4691912312317]
Mixture-of-Experts (MoE) language models can reduce computational costs by 2-4$times$ compared to dense models without sacrificing performance. We propose a hybrid dense training and sparse inference framework for MoE models (DS-MoE) which achieves strong computation and parameter efficiency.
arXiv Detail & Related papers (2024-04-08T14:39:49Z)
Approximating Two-Layer Feedforward Networks for Efficient Transformers [15.793406740545024]
We present a general framework that unifies various methods to approximate two-layer NNs, including product-key memories (PKMs) We show that our MoEs are competitive with the dense Transformer-XL on both the WikiText-103 and enwiki8 datasets at two different scales. This demonstrates that MoEs are relevant not only to extremely large LMs but also to any-scale resource-efficient LMs.
arXiv Detail & Related papers (2023-10-16T21:23:16Z)
MatFormer: Nested Transformer for Elastic Inference [91.45687988953435]
MatFormer is a novel Transformer architecture designed to provide elastic inference across diverse deployment constraints.<n>MatFormer achieves this by incorporating a nested Feed Forward Network (FFN) block structure within a standard Transformer model.<n>We show that a 850M decoder-only MatFormer language model (MatLM) allows us to extract multiple smaller models spanning from 582M to 850M parameters.
arXiv Detail & Related papers (2023-10-11T17:57:14Z)
Residual Mixture of Experts [75.5489156421442]
Residual Mixture of Experts (RMoE) is an efficient training pipeline for MoE vision transformers on downstream tasks. RMoE achieves comparable results with the upper-bound MoE training, while only introducing minor additional training cost.
arXiv Detail & Related papers (2022-04-20T17:29:48Z)
StableMoE: Stable Routing Strategy for Mixture of Experts [109.0602120199226]
Mixture-of-Experts (MoE) technique can scale up the model size of Transformers with an affordable computational overhead. We propose StableMoE with two training stages to address the routing fluctuation problem. Results show that StableMoE outperforms existing MoE methods in terms of both convergence speed and performance.
arXiv Detail & Related papers (2022-04-18T16:48:19Z)
Taming Sparsely Activated Transformer with Stochastic Experts [76.0711573018493]
Sparsely activated models (SAMs) can easily scale to have outrageously large amounts of parameters without significant increase in computational cost. In this paper, we propose a new expert-based model, THOR (Transformer witH StOchastic ExpeRts) Unlike classic expert-based models, such as the Switch Transformer, experts in THOR are randomly activated for each input during training and inference.
arXiv Detail & Related papers (2021-10-08T17:15:47Z)
Understanding the Difficulty of Training Transformers [120.99980924577787]
We show that unbalanced gradients are not the root cause of the instability of training. We propose Admin to stabilize the early stage's training and unleash its full potential in the late stage.
arXiv Detail & Related papers (2020-04-17T13:59:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.