Multi-Head LatentMoE and Head Parallel: Communication-Efficient and Deterministic MoE Parallelism
- URL: http://arxiv.org/abs/2602.04870v1
- Date: Wed, 04 Feb 2026 18:57:19 GMT
- Title: Multi-Head LatentMoE and Head Parallel: Communication-Efficient and Deterministic MoE Parallelism
- Authors: Chenwei Cui, Rockwell Jackson, Benjamin Joseph Herrera, Ana María Tárano, Hannah Kerner,
- Abstract summary: Multi-Head LatentMoE and Head Parallel achieve $O(1)$ communication cost regardless of $k$, completely balanced traffic, and deterministic communication.<n>Compared to MoE with EP, Multi-Head LatentMoE with HP trains up to $1.61times$ faster while having identical performance.
- Score: 7.862911132148511
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models have transformed many applications but remain expensive to train. Sparse Mixture of Experts (MoE) addresses this through conditional computation, with Expert Parallel (EP) as the standard distributed training method. However, EP has three limitations: communication cost grows linearly with the number of activated experts $k$, load imbalance affects latency and memory usage, and data-dependent communication requires metadata exchange. We propose Multi-Head LatentMoE and Head Parallel (HP), a new architecture and parallelism achieving $O(1)$ communication cost regardless of $k$, completely balanced traffic, and deterministic communication, all while remaining compatible with EP. To accelerate Multi-Head LatentMoE, we propose IO-aware routing and expert computation. Compared to MoE with EP, Multi-Head LatentMoE with HP trains up to $1.61\times$ faster while having identical performance. With doubled granularity, it achieves higher overall performance while still being $1.11\times$ faster. Our method makes multi-billion-parameter foundation model research more accessible.
Related papers
- MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping [52.02659589971978]
We propose MoDES, the first training-free framework that adaptively skips experts to enable efficient and accurate MoE MLLM inference.<n>MoDES significantly enhances inference speed, improving the prefilling time by 2.16$times$ and the decoding time by 1.26$times$.
arXiv Detail & Related papers (2025-11-19T18:48:27Z) - Occult: Optimizing Collaborative Communication across Experts for Accelerated Parallel MoE Training and Inference [77.07591324890537]
We propose system- and algorithm-level innovations to reduce communication costs.<n>We show that augmenting the proportion of intra-collaboration can accelerate expert parallelism at scale.<n>Our designs are capable of either delivering exact results with reduced communication cost or controllably minimizing the cost with collaboration pruning.
arXiv Detail & Related papers (2025-05-19T16:50:27Z) - Speculative MoE: Communication Efficient Parallel MoE Inference with Speculative Token and Expert Pre-scheduling [3.529891364583952]
MoE (Mixture of Experts) prevails as a neural architecture that can scale modern transformer-based LLMs (Large Language Models) to unprecedented scales.<n>DeepSpeed-MoE, one state-of-the-art MoE inference framework, adopts a 3D-parallel paradigm including EP (Expert Parallelism), TP (Tensor Parallel) and DP (Data Parallelism)<n>Our work aims to boost DeepSpeed-MoE by strategically reducing EP's communication overhead with a technique named Speculative MoE.
arXiv Detail & Related papers (2025-03-06T12:52:22Z) - Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts [8.80408909878008]
Mixture-of-experts (MoE) has been extensively employed to scale large language models to trillion-plus parameters.<n>Existing methods suggest the communication in a MoE layer to be pipelined with the computation for overlapping.<n>We present COMET, an optimized MoE system with fine-grained communication-computation overlapping.
arXiv Detail & Related papers (2025-02-27T06:36:45Z) - EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE that surpasses the existing parallelism schemes.<n>Our results demonstrate at most 52.4% improvement in prefill throughput compared to existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z) - Parm: Efficient Training of Large Sparsely-Activated Models with Dedicated Schedules [15.680276212483292]
We propose Parm, a system that accelerates MP+EP+ESP training by designing two dedicated schedules for placing communication tasks.
Parm achieves 1.13$times$ to 5.77$times$ speedup on 1296 manually configured MoE layers and approximately 3$times$ improvement on two real-world MoE models.
arXiv Detail & Related papers (2024-06-30T05:55:11Z) - Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference [23.633481089469836]
Auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance.<n>We propose a novel parallel prompt decoding that requires only $0.0002$% trainable parameters, enabling efficient training on a single A100-40GB GPU in just 16 hours.<n>Our approach demonstrates up to 2.49$times$ speedup and maintains a minimal memory overhead of just $0.0004$%.
arXiv Detail & Related papers (2024-05-28T22:19:30Z) - Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models [62.4691912312317]
Mixture-of-Experts (MoE) language models can reduce computational costs by 2-4$times$ compared to dense models without sacrificing performance.
We propose a hybrid dense training and sparse inference framework for MoE models (DS-MoE) which achieves strong computation and parameter efficiency.
arXiv Detail & Related papers (2024-04-08T14:39:49Z) - Shortcut-connected Expert Parallelism for Accelerating Mixture-of-Experts [4.629608387540524]
ScMoE is a novel shortcut-connected MoE architecture integrated with an overlapping parallelization strategy.<n>Compared to the prevalent top-2 MoE baseline, ScMoE achieves speedups of 1.49 times in training and 1.82 times in inference.
arXiv Detail & Related papers (2024-04-07T17:17:23Z) - Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture
with Task-level Sparsity via Mixture-of-Experts [60.1586169973792]
M$3$ViT is the latest multi-task ViT model that introduces mixture-of-experts (MoE)
MoE achieves better accuracy and over 80% reduction computation but leaves challenges for efficient deployment on FPGA.
Our work, dubbed Edge-MoE, solves the challenges to introduce the first end-to-end FPGA accelerator for multi-task ViT with a collection of architectural innovations.
arXiv Detail & Related papers (2023-05-30T02:24:03Z) - Pipeline MoE: A Flexible MoE Implementation with Pipeline Parallelism [91.9372563527801]
Existing MoE models suffer from tremendous inner-node and inter-node communication overhead.
We propose a novel MoE architecture called Pipeline MoE (PPMoE) to tackle them.
PPMoE builds expert parallel incorporating with tensor parallel and replaces communication-intensive all-to-all dispatching and gathering.
arXiv Detail & Related papers (2023-04-22T14:09:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.