Related papers: LAER-MoE: Load-Adaptive Expert Re-layout for Efficient Mixture-of-Experts Training

LAER-MoE: Load-Adaptive Expert Re-layout for Efficient Mixture-of-Experts Training

URL: http://arxiv.org/abs/2602.11686v1
Date: Thu, 12 Feb 2026 08:08:03 GMT
Title: LAER-MoE: Load-Adaptive Expert Re-layout for Efficient Mixture-of-Experts Training
Authors: Xinyi Liu, Yujie Wang, Fangcheng Fu, Xuefeng Xiao, Huixia Li, Jiashi Li, Bin Cui,
Abstract summary: In this paper, we introduce LAER-MoE, an efficient MoE training framework.<n>The core of LAER-MoE is a novel parallel paradigm, Fully Sharded Expert Parallel (FSEP), which fully partitions each expert parameter by the number of devices.<n>We perform experiments on an A100 cluster, and the results indicate that our system achieves up to 1.69x acceleration compared to the current state-of-the-art training systems.
Score: 27.022187489292467
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Expert parallelism is vital for effectively training Mixture-of-Experts (MoE) models, enabling different devices to host distinct experts, with each device processing different input data. However, during expert parallel training, dynamic routing results in significant load imbalance among experts: a handful of overloaded experts hinder overall iteration, emerging as a training bottleneck. In this paper, we introduce LAER-MoE, an efficient MoE training framework. The core of LAER-MoE is a novel parallel paradigm, Fully Sharded Expert Parallel (FSEP), which fully partitions each expert parameter by the number of devices and restores partial experts at expert granularity through All-to-All communication during training. This allows for flexible re-layout of expert parameters during training to enhance load balancing. In particular, we perform fine-grained scheduling of communication operations to minimize communication overhead. Additionally, we develop a load balancing planner to formulate re-layout strategies of experts and routing schemes for tokens during training. We perform experiments on an A100 cluster, and the results indicate that our system achieves up to 1.69x acceleration compared to the current state-of-the-art training systems. Source code available at https://github.com/PKU-DAIR/Hetu-Galvatron/tree/laer-moe.

Related papers

MoSE: Mixture of Slimmable Experts for Efficient and Adaptive Language Models [28.87682703032017]
Mixture-of-Experts (MoE) models scale large language models efficiently by sparsely activating experts, but once an expert is selected, it is executed fully.<n>We propose Mixture of Slimmable Experts (MoSE), an MoE architecture in which each expert has a nested, slimmable structure that can be executed at variable widths.
arXiv Detail & Related papers (2026-02-05T19:48:41Z)
SAME: Stabilized Mixture-of-Experts for Multimodal Continual Instruction Tuning [83.66308307152808]
We propose StAbilized Mixture-of-Experts (SAME) for Multimodal Continual Instruction Tuning (MCIT)<n>SAME stabilizes expert selection by decomposing routing dynamics into subspaces and updating only task-relevant directions.<n>It also introduces adaptive expert activation to freeze selected experts during training, reducing redundant and cross-task interference.
arXiv Detail & Related papers (2026-02-02T11:47:06Z)
Least-Loaded Expert Parallelism: Load Balancing An Imbalanced Mixture-of-Experts [74.40169987564724]
Expert parallelism (EP) is designed to scale MoE models by distributing experts across multiple devices.<n>Under extreme imbalance, EP can funnel a disproportionate number of tokens to a small number of experts, leading to compute- and memory-bound failures.<n>We propose Least-Loaded Expert Parallelism (LLEP), a novel EP algorithm that dynamically reroutes excess tokens and associated expert parameters from overloaded devices to underutilized ones.
arXiv Detail & Related papers (2026-01-23T18:19:15Z)
Training Matryoshka Mixture-of-Experts for Elastic Inference-Time Expert Utilization [60.309915093470416]
Matryoshka MoE (M-MoE) is a training framework that instills a coarse-to-fine structure directly into the expert ensemble.<n>Our work paves the way for more practical and adaptable deployments of large-scale MoE models.
arXiv Detail & Related papers (2025-09-30T16:56:44Z)
TT-LoRA MoE: Unifying Parameter-Efficient Fine-Tuning and Sparse Mixture-of-Experts [4.5558042369389105]
TT-LoRA MoE decomposes training into two distinct optimized stages.<n>First, we independently train lightweight, tensorized low-rank adapters (TT-LoRA experts)<n>Subsequently, these expert adapters remain frozen, eliminating inter-task interference and forgetting in multi-task setting.<n>A sparse MoE router, trained separately, dynamically leverages base model representations to select exactly one specialized adapter per input at inference time.<n> Comprehensive experiments confirm our architecture retains the memory efficiency of low-rank adapters, seamlessly scales to large expert pools, and achieves robust task-level optimization.
arXiv Detail & Related papers (2025-04-29T21:46:43Z)
SYMI: Efficient Mixture-of-Experts Training via Model and Optimizer State Decoupling [1.2777855412373709]
Mixture-of-Experts (MoE) models have become a widely-adopted solution to continue scaling model sizes without a corresponding linear increase in compute.<n>Current systems are forced to either drop tokens assigned to popular experts, degrading convergence, or frequently rebalance resources allocated to each expert based on popularity.<n>We introduce SYMI, an adaptive MoE training system.
arXiv Detail & Related papers (2025-04-28T15:58:55Z)
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM [81.18305296110853]
We investigate efficient methods for training Large Language Models (LLMs) to possess capabilities in multiple specialized domains. Our method, named Branch-Train-MiX (BTX), starts from a seed model, which is branched to train experts in embarrassingly parallel fashion. BTX generalizes two special cases, the Branch-Train-Merge method, which does not have the MoE finetuning stage to learn routing, and sparse upcycling, which omits the stage of training experts asynchronously.
arXiv Detail & Related papers (2024-03-12T16:54:58Z)
Pipeline MoE: A Flexible MoE Implementation with Pipeline Parallelism [91.9372563527801]
Existing MoE models suffer from tremendous inner-node and inter-node communication overhead. We propose a novel MoE architecture called Pipeline MoE (PPMoE) to tackle them. PPMoE builds expert parallel incorporating with tensor parallel and replaces communication-intensive all-to-all dispatching and gathering.
arXiv Detail & Related papers (2023-04-22T14:09:14Z)
BASE Layers: Simplifying Training of Large, Sparse Models [53.98145464002843]
We introduce a new balanced assignment of experts (BASE) layer for large language models. Sparse layers can dramatically improve the efficiency of training and inference by routing each token to specialized expert modules. We formulate token-to-expert allocation as a linear assignment problem, allowing an optimal assignment in which each expert receives an equal number of tokens.
arXiv Detail & Related papers (2021-03-30T23:08:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.