HierMoE: Accelerating MoE Training with Hierarchical Token Deduplication and Expert Swap
- URL: http://arxiv.org/abs/2508.09591v1
- Date: Wed, 13 Aug 2025 08:16:31 GMT
- Title: HierMoE: Accelerating MoE Training with Hierarchical Token Deduplication and Expert Swap
- Authors: Wenxiang Lin, Xinglin Pan, Lin Zhang, Shaohuai Shi, Xuan Wang, Xiaowen Chu,
- Abstract summary: We introduce HierMoE to accelerate the training of large language models (LLMs) by two topology-aware techniques.<n>Our prototype HierMoE achieves $1.55times$ to $3.32times$ faster communication and delivers $1.18times$ to $1.27times$ faster end-to-end training.
- Score: 17.1806530983927
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The sparsely activated mixture-of-experts (MoE) transformer has become a common architecture for large language models (LLMs) due to its sparsity, which requires fewer computational demands while easily scaling the model size. In MoE models, each MoE layer requires to dynamically choose tokens to activate particular experts for computation while the activated experts may not be located in the same device or GPU as the token. However, this leads to substantial communication and load imbalances across all GPUs, which obstructs the scalability of distributed systems within a GPU cluster. To this end, we introduce HierMoE to accelerate the training of MoE models by two topology-aware techniques: 1) token deduplication to reduce the communication traffic, and 2) expert swap to balance the workloads among all GPUs. To enable the above two proposed approaches to be more general, we build theoretical models aimed at achieving the best token duplication and expert swap strategy under different model configurations and hardware environments. We implement our prototype HierMoE system atop Megatron-LM and conduct experiments on a 32-GPU cluster with DeepSeek-V3 and Qwen3-30B-A3B models. Experimental results show that our HierMoE achieves $1.55\times$ to $3.32\times$ faster communication and delivers $1.18\times$ to $1.27\times$ faster end-to-end training compared to state-of-the-art MoE training systems, Tutel-2DH, SmartMoE, and Megatron-LM.
Related papers
- MiMo-V2-Flash Technical Report [101.08860732385551]
We present MiMo-V2-Flash, a Mixture-of-Experts (MoE) model with 309B total parameters and 15B active parameters.<n>MiMo-V2-Flash adopts a hybrid attention architecture that interleaves Sliding Window Attention (SWA) with global attention.<n>The model is pre-trained on 27 trillion tokens with Multi-Token Prediction (MTP), employing a native 32k context length and subsequently extended to 256k.
arXiv Detail & Related papers (2026-01-06T07:31:47Z) - SlimMoE: Structured Compression of Large MoE Models via Expert Slimming and Distillation [82.53411922988039]
We introduce SlimMoE, a multi-stage compression framework for transforming large MoE models into much smaller, efficient variants.<n>Using this framework, we compress Phi 3.5-MoE (41.9B total/6.6B activated parameters) to create Phi-mini-MoE (7.6B total/2.4B activated parameters) and Phi-tiny-MoE (3.8B total/1.1B activated parameters)<n>Our experiments demonstrate that these compressed models outperform others of similar size and remain competitive with larger models.
arXiv Detail & Related papers (2025-06-23T07:15:59Z) - Pangu Pro MoE: Mixture of Grouped Experts for Efficient Sparsity [105.54207710201183]
MoGE constrains tokens to activate an equal number of experts within each predefined expert group.<n>Pangu Pro MoE achieves 1148 tokens/s per card and can be further improved to 1528 tokens/s per card by speculative acceleration.
arXiv Detail & Related papers (2025-05-27T16:40:21Z) - Balanced and Elastic End-to-end Training of Dynamic LLMs [2.7461964910607097]
We propose an autonomous dynamic load balancing solution, DynMo, for large-scale distributed training.<n>DynMo provably achieves maximum reduction in workload imbalance and adaptively equalizes compute loads across workers.<n>Compared to static distributed training solutions such as Megatron-LM and DeepSpeed, DynMo accelerates the end-to-end training of dynamic GPT models by up to 1.23x for MoEs, 3.18x for parameter pruning, 2.23x for layer freezing, 4.02x for sparse attention, 4.52x for early exit, and 1.17x for MoDs
arXiv Detail & Related papers (2025-05-20T19:52:57Z) - FSMoE: A Flexible and Scalable Training System for Sparse Mixture-of-Experts Models [21.96960353910023]
We introduce FSMoE, a flexible training system optimizing task scheduling with three novel techniques.<n>We conduct extensive experiments with configured MoE layers and real-world MoE models on two GPU clusters.<n> FSMoE supports four popular types of MoE routing functions and is more efficient than existing implementations.
arXiv Detail & Related papers (2025-01-18T10:14:37Z) - 2 OLMo 2 Furious [154.15728448754854]
We present OLMo 2, the next generation of our fully open language models.<n> OLMo 2 includes a family of dense autoregressive language models at 7B, 13B and 32B scales.<n>We describe our modified model architecture and training recipe.
arXiv Detail & Related papers (2024-12-31T21:55:10Z) - EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE that surpasses the existing parallelism schemes.<n>Our results demonstrate at most 52.4% improvement in prefill throughput compared to existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z) - Parm: Efficient Training of Large Sparsely-Activated Models with Dedicated Schedules [15.680276212483292]
We propose Parm, a system that accelerates MP+EP+ESP training by designing two dedicated schedules for placing communication tasks.
Parm achieves 1.13$times$ to 5.77$times$ speedup on 1296 manually configured MoE layers and approximately 3$times$ improvement on two real-world MoE models.
arXiv Detail & Related papers (2024-06-30T05:55:11Z) - A Closer Look into Mixture-of-Experts in Large Language Models [26.503570706063634]
Mixture-of-experts (MoE) is gaining increasing attention due to its unique properties and remarkable performance.<n>MoE architecture could increase the model size without sacrificing computational efficiency.<n>We make an initial attempt to understand the inner workings of MoE-based large language models.
arXiv Detail & Related papers (2024-06-26T10:07:57Z) - Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models [62.4691912312317]
Mixture-of-Experts (MoE) language models can reduce computational costs by 2-4$times$ compared to dense models without sacrificing performance.
We propose a hybrid dense training and sparse inference framework for MoE models (DS-MoE) which achieves strong computation and parameter efficiency.
arXiv Detail & Related papers (2024-04-08T14:39:49Z) - Pipeline MoE: A Flexible MoE Implementation with Pipeline Parallelism [91.9372563527801]
Existing MoE models suffer from tremendous inner-node and inter-node communication overhead.
We propose a novel MoE architecture called Pipeline MoE (PPMoE) to tackle them.
PPMoE builds expert parallel incorporating with tensor parallel and replaces communication-intensive all-to-all dispatching and gathering.
arXiv Detail & Related papers (2023-04-22T14:09:14Z) - MoESys: A Distributed and Efficient Mixture-of-Experts Training and Inference System for Internet Services [32.278096820269816]
We present a novel MoESys that boosts efficiency in both large-scale training and inference.
Specifically, in the training procedure, the proposed MoESys adopts an Elastic MoE training strategy with 2D prefetch and Fusion communication over Hierarchical storage.
For scalable inference in a single node, MoESys builds the CPU-GPU memory jointly into a ring of sections to load the model, and executes the computation tasks across the memory sections in a round-robin manner for efficient inference.
arXiv Detail & Related papers (2022-05-20T09:09:27Z) - FastMoE: A Fast Mixture-of-Expert Training System [20.74001755688784]
Mixture-of-Expert (MoE) presents a strong potential in enlarging the size of language model to trillions of parameters.
FastMoE is a distributed MoE training system based on PyTorch with common accelerators.
arXiv Detail & Related papers (2021-03-24T15:27:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.