FSMoE: A Flexible and Scalable Training System for Sparse Mixture-of-Experts Models
- URL: http://arxiv.org/abs/2501.10714v1
- Date: Sat, 18 Jan 2025 10:14:37 GMT
- Title: FSMoE: A Flexible and Scalable Training System for Sparse Mixture-of-Experts Models
- Authors: Xinglin Pan, Wenxiang Lin, Lin Zhang, Shaohuai Shi, Zhenheng Tang, Rui Wang, Bo Li, Xiaowen Chu,
- Abstract summary: We introduce FSMoE, a flexible training system optimizing task scheduling with three novel techniques.
We conduct extensive experiments with configured MoE layers and real-world MoE models on two GPU clusters.
FSMoE supports four popular types of MoE routing functions and is more efficient than existing implementations.
- Score: 21.96960353910023
- License:
- Abstract: Recent large language models (LLMs) have tended to leverage sparsity to reduce computations, employing the sparsely activated mixture-of-experts (MoE) technique. MoE introduces four modules, including token routing, token communication, expert computation, and expert parallelism, that impact model quality and training efficiency. To enable versatile usage of MoE models, we introduce FSMoE, a flexible training system optimizing task scheduling with three novel techniques: 1) Unified abstraction and online profiling of MoE modules for task scheduling across various MoE implementations. 2) Co-scheduling intra-node and inter-node communications with computations to minimize communication overheads. 3) To support near-optimal task scheduling, we design an adaptive gradient partitioning method for gradient aggregation and a schedule to adaptively pipeline communications and computations. We conduct extensive experiments with configured MoE layers and real-world MoE models on two GPU clusters. Experimental results show that 1) our FSMoE supports four popular types of MoE routing functions and is more efficient than existing implementations (with up to a 1.42$\times$ speedup), and 2) FSMoE outperforms the state-of-the-art MoE training systems (DeepSpeed-MoE and Tutel) by 1.18$\times$-1.22$\times$ on 1458 MoE layers and 1.19$\times$-3.01$\times$ on real-world MoE models based on GPT-2 and Mixtral using a popular routing function.
Related papers
- mFabric: An Efficient and Scalable Fabric for Mixture-of-Experts Training [21.5342833453912]
Mixture-of-Expert (MoE) models outperform conventional models by selectively activating differents, named emphexperts, on a per-token basis.
This gated computation generates dynamic communications that cannot be determined beforehand, challenging the existing GPU interconnects that remain emphstatic during the distributed training process.
We advocate for a first-of-its-kind system, called mFabric, that unlocks topology reconfiguration emphduring distributed MoE training.
arXiv Detail & Related papers (2025-01-07T16:19:40Z) - Efficient and Effective Weight-Ensembling Mixture of Experts for Multi-Task Model Merging [111.8456671452411]
Multi-task learning (MTL) leverages a shared model to accomplish multiple tasks and facilitate knowledge transfer.
We propose a Weight-Ensembling Mixture of Experts (WEMoE) method for multi-task model merging.
We show that WEMoE and E-WEMoE outperform state-of-the-art (SOTA) model merging methods in terms of MTL performance, generalization, and robustness.
arXiv Detail & Related papers (2024-10-29T07:16:31Z) - EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE that surpasses the existing parallelism schemes.
Our results demonstrate at most 52.4% improvement in prefill throughput compared to existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z) - MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts [95.26323548734692]
MoMa is a modality-aware mixture-of-experts architecture for pre-training mixed-modal, early-fusion language models.
Under a 1-trillion-token training budget, the MoMa 1.4B model, featuring 4 text experts and 4 image experts, achieves impressive FLOPs savings.
arXiv Detail & Related papers (2024-07-31T17:46:51Z) - Parm: Efficient Training of Large Sparsely-Activated Models with Dedicated Schedules [15.680276212483292]
We propose Parm, a system that accelerates MP+EP+ESP training by designing two dedicated schedules for placing communication tasks.
Parm achieves 1.13$times$ to 5.77$times$ speedup on 1296 manually configured MoE layers and approximately 3$times$ improvement on two real-world MoE models.
arXiv Detail & Related papers (2024-06-30T05:55:11Z) - Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models [62.4691912312317]
Mixture-of-Experts (MoE) language models can reduce computational costs by 2-4$times$ compared to dense models without sacrificing performance.
We propose a hybrid dense training and sparse inference framework for MoE models (DS-MoE) which achieves strong computation and parameter efficiency.
arXiv Detail & Related papers (2024-04-08T14:39:49Z) - Pipeline MoE: A Flexible MoE Implementation with Pipeline Parallelism [91.9372563527801]
Existing MoE models suffer from tremendous inner-node and inter-node communication overhead.
We propose a novel MoE architecture called Pipeline MoE (PPMoE) to tackle them.
PPMoE builds expert parallel incorporating with tensor parallel and replaces communication-intensive all-to-all dispatching and gathering.
arXiv Detail & Related papers (2023-04-22T14:09:14Z) - TA-MoE: Topology-Aware Large Scale Mixture-of-Expert Training [18.68993910156101]
We propose TA-MoE, a topology-aware routing strategy for large-scale MoE trainging.
We show that TA-MoE can substantially outperform its counterparts on various hardware and model configurations.
arXiv Detail & Related papers (2023-02-20T11:18:24Z) - Tutel: Adaptive Mixture-of-Experts at Scale [20.036168971435306]
Sparsely-gated mixture-of-experts (MoE) has been widely adopted to scale deep learning models to trillion-plus parameters with fixed computational cost.
We present Flex, a highly scalable stack design and implementation for MoE with dynamically adaptive parallelism and pipelining.
Our evaluation shows that Flex efficiently and effectively runs a real-world MoE-based model named SwinV2-MoE, built upon Swin Transformer V2, a state-of-the-art computer vision architecture.
arXiv Detail & Related papers (2022-06-07T15:20:20Z) - Scalable and Efficient MoE Training for Multitask Multilingual Models [55.987536562357086]
We develop a system capable of scaling MoE models efficiently to trillions of parameters.
We also present new training methods to improve MoE sample efficiency and leverage expert pruning strategy to improve time efficiency.
A model trained with 10 billion parameters on 50 languages can achieve state-of-the-art performance in Machine Translation (MT) and multilingual natural language generation tasks.
arXiv Detail & Related papers (2021-09-22T00:57:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.