TA-MoE: Topology-Aware Large Scale Mixture-of-Expert Training
- URL: http://arxiv.org/abs/2302.09915v1
- Date: Mon, 20 Feb 2023 11:18:24 GMT
- Title: TA-MoE: Topology-Aware Large Scale Mixture-of-Expert Training
- Authors: Chang Chen, Min Li, Zhihua Wu, Dianhai Yu, Chao Yang
- Abstract summary: We propose TA-MoE, a topology-aware routing strategy for large-scale MoE trainging.
We show that TA-MoE can substantially outperform its counterparts on various hardware and model configurations.
- Score: 18.68993910156101
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sparsely gated Mixture-of-Expert (MoE) has demonstrated its effectiveness in
scaling up deep neural networks to an extreme scale. Despite that numerous
efforts have been made to improve the performance of MoE from the model design
or system optimization perspective, existing MoE dispatch patterns are still
not able to fully exploit the underlying heterogeneous network environments. In
this paper, we propose TA-MoE, a topology-aware routing strategy for
large-scale MoE trainging, from a model-system co-design perspective, which can
dynamically adjust the MoE dispatch pattern according to the network topology.
Based on communication modeling, we abstract the dispatch problem into an
optimization objective and obtain the approximate dispatch pattern under
different topologies. On top of that, we design a topology-aware auxiliary
loss, which can adaptively route the data to fit in the underlying topology
without sacrificing the model accuracy. Experiments show that TA-MoE can
substantially outperform its counterparts on various hardware and model
configurations, with roughly 1.01x-1.61x, 1.01x-4.77x, 1.25x-1.54x improvements
over the popular DeepSpeed-MoE, FastMoE and FasterMoE.
Related papers
- FSMoE: A Flexible and Scalable Training System for Sparse Mixture-of-Experts Models [21.96960353910023]
We introduce FSMoE, a flexible training system optimizing task scheduling with three novel techniques.
We conduct extensive experiments with configured MoE layers and real-world MoE models on two GPU clusters.
FSMoE supports four popular types of MoE routing functions and is more efficient than existing implementations.
arXiv Detail & Related papers (2025-01-18T10:14:37Z) - Attention Is All You Need For Mixture-of-Depths Routing [5.419910566904439]
We introduce a novel attention-based routing mechanism A-MoD.
A-MoD allows for more efficient training as it introduces no additional trainable parameters.
It can increase the performance of the MoD model.
arXiv Detail & Related papers (2024-12-30T11:25:54Z) - EMOv2: Pushing 5M Vision Model Frontier [92.21687467702972]
We set up the new frontier of the 5M magnitude lightweight model on various downstream tasks.
Our work rethinks the lightweight infrastructure of efficient IRB and practical components in Transformer.
Considering the imperceptible latency for mobile users when downloading models under 4G/5G bandwidth, we investigate the performance upper limit of lightweight models with a magnitude of 5M.
arXiv Detail & Related papers (2024-12-09T17:12:22Z) - Efficient and Effective Weight-Ensembling Mixture of Experts for Multi-Task Model Merging [111.8456671452411]
Multi-task learning (MTL) leverages a shared model to accomplish multiple tasks and facilitate knowledge transfer.
We propose a Weight-Ensembling Mixture of Experts (WEMoE) method for multi-task model merging.
We show that WEMoE and E-WEMoE outperform state-of-the-art (SOTA) model merging methods in terms of MTL performance, generalization, and robustness.
arXiv Detail & Related papers (2024-10-29T07:16:31Z) - Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design [59.00758127310582]
We propose a novel framework Read-ME that transforms pre-trained dense LLMs into smaller MoE models.
Our approach employs activation sparsity to extract experts.
Read-ME outperforms other popular open-source dense models of similar scales.
arXiv Detail & Related papers (2024-10-24T19:48:51Z) - LaDiMo: Layer-wise Distillation Inspired MoEfier [1.6199400106794555]
We propose a novel algorithm, LaDiMo, which efficiently converts a Transformer-based non-MoE model into a MoE model with minimal additional training cost.
We demonstrate the effectiveness of our method by converting the LLaMA2-7B model to a MoE model using only 100K tokens.
arXiv Detail & Related papers (2024-08-08T07:37:26Z) - Towards Efficient Pareto Set Approximation via Mixture of Experts Based Model Fusion [53.33473557562837]
Solving multi-objective optimization problems for large deep neural networks is a challenging task due to the complexity of the loss landscape and the expensive computational cost.
We propose a practical and scalable approach to solve this problem via mixture of experts (MoE) based model fusion.
By ensembling the weights of specialized single-task models, the MoE module can effectively capture the trade-offs between multiple objectives.
arXiv Detail & Related papers (2024-06-14T07:16:18Z) - A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-Experts [49.394145046409044]
This paper provides the first provably efficient technique for pruning experts in finetuned MoE models.
We theoretically prove that prioritizing the pruning of the experts with a smaller change of the routers l2 norm from the pretrained model guarantees the preservation of test accuracy.
Although our theoretical analysis is centered on binary classification tasks on simplified MoE architecture, our expert pruning method is verified on large vision MoE models.
arXiv Detail & Related papers (2024-05-26T17:52:58Z) - U2++ MoE: Scaling 4.7x parameters with minimal impact on RTF [10.81723269312202]
Mixture-of-Experts (MoE) have been proposed as an energy efficient path to larger and more capable language models.
We benchmark our proposed model on a large scale inner-source dataset (160k hours)
arXiv Detail & Related papers (2024-04-25T08:34:21Z) - Domain Generalization via Balancing Training Difficulty and Model
Capability [61.053202176230904]
Domain generalization (DG) aims to learn domain-generalizable models from one or multiple source domains that can perform well in unseen target domains.
Despite its recent progress, most existing work suffers from the misalignment between the difficulty level of training samples and the capability of contemporarily trained models.
We design MoDify, a Momentum Difficulty framework that tackles the misalignment by balancing the seesaw between the model's capability and the samples' difficulties.
arXiv Detail & Related papers (2023-09-02T07:09:23Z) - FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via
Dynamic Device Placement [19.639936387834677]
Mixture-of-Experts (MoEs) are becoming more popular and have demonstrated impressive pretraining scalability in various downstream tasks.
MoEs are becoming a new data analytics paradigm in the data life cycle and suffering from unique challenges at scales, complexities, and granularities never before possible.
In this paper, we propose a novel DNN training framework, FlexMoE, which systematically and transparently address the inefficiency caused by dynamic dataflow.
arXiv Detail & Related papers (2023-04-08T07:34:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.