AutoMoE: Heterogeneous Mixture-of-Experts with Adaptive Computation for
Efficient Neural Machine Translation
- URL: http://arxiv.org/abs/2210.07535v2
- Date: Wed, 7 Jun 2023 22:41:40 GMT
- Title: AutoMoE: Heterogeneous Mixture-of-Experts with Adaptive Computation for
Efficient Neural Machine Translation
- Authors: Ganesh Jawahar, Subhabrata Mukherjee, Xiaodong Liu, Young Jin Kim,
Muhammad Abdul-Mageed, Laks V. S. Lakshmanan, Ahmed Hassan Awadallah,
Sebastien Bubeck, Jianfeng Gao
- Abstract summary: Mixture-of-Expert (MoE) models have obtained state-of-the-art performance in Neural Machine Translation (NMT) tasks.
Existing MoE models mostly consider a homogeneous design where the same number of experts of the same size are placed uniformly throughout the network.
We develop AutoMoE -- a framework for designing heterogeneous MoE's under computational constraints.
- Score: 104.0979785739202
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Mixture-of-Expert (MoE) models have obtained state-of-the-art performance in
Neural Machine Translation (NMT) tasks. Existing works in MoE mostly consider a
homogeneous design where the same number of experts of the same size are placed
uniformly throughout the network. Furthermore, existing MoE works do not
consider computational constraints (e.g., FLOPs, latency) to guide their
design. To this end, we develop AutoMoE -- a framework for designing
heterogeneous MoE's under computational constraints. AutoMoE leverages Neural
Architecture Search (NAS) to obtain efficient sparse MoE sub-transformers with
4x inference speedup (CPU) and FLOPs reduction over manually designed
Transformers, with parity in BLEU score over dense Transformer and within 1
BLEU point of MoE SwitchTransformer, on aggregate over benchmark datasets for
NMT. Heterogeneous search space with dense and sparsely activated Transformer
modules (e.g., how many experts? where to place them? what should be their
sizes?) allows for adaptive compute -- where different amounts of computations
are used for different tokens in the input. Adaptivity comes naturally from
routing decisions which send tokens to experts of different sizes. AutoMoE
code, data, and trained models are available at https://aka.ms/AutoMoE.
Related papers
- UniMoD: Efficient Unified Multimodal Transformers with Mixture-of-Depths [17.68867710994329]
UniMoD is a task-aware token pruning method that employs a separate router for each task to determine which tokens should be pruned.
We apply our method to Show-o and Emu3, reducing training FLOPs by approximately 15% in Show-o and 40% in Emu3, while maintaining or improving performance on several benchmarks.
arXiv Detail & Related papers (2025-02-10T13:52:52Z) - Mixture-of-Modules: Reinventing Transformers as Dynamic Assemblies of Modules [96.21649779507831]
We propose a novel architecture dubbed mixture-of-modules (MoM)
MoM is motivated by an intuition that any layer, regardless of its position, can be used to compute a token.
We show that MoM provides not only a unified framework for Transformers but also a flexible and learnable approach for reducing redundancy.
arXiv Detail & Related papers (2024-07-09T08:50:18Z) - MoEUT: Mixture-of-Experts Universal Transformers [75.96744719516813]
Universal Transformers (UTs) have advantages over standard Transformers in learning compositional generalizations.
Layer-sharing drastically reduces the parameter count compared to the non-shared model with the same dimensionality.
No previous work has succeeded in proposing a shared-layer Transformer design that is competitive in parameter count-dominated tasks such as language modeling.
arXiv Detail & Related papers (2024-05-25T03:24:32Z) - XMoE: Sparse Models with Fine-grained and Adaptive Expert Selection [30.687511115573038]
tool is a novel MoE designed to enhance both the efficacy and efficiency of sparse MoE models.
tool can enhance model performance while decreasing the computation load at MoE layers by over 50% without sacrificing performance.
arXiv Detail & Related papers (2024-02-27T08:18:02Z) - Approximating Two-Layer Feedforward Networks for Efficient Transformers [15.793406740545024]
We present a general framework that unifies various methods to approximate two-layer NNs, including product-key memories (PKMs)
We show that our MoEs are competitive with the dense Transformer-XL on both the WikiText-103 and enwiki8 datasets at two different scales.
This demonstrates that MoEs are relevant not only to extremely large LMs but also to any-scale resource-efficient LMs.
arXiv Detail & Related papers (2023-10-16T21:23:16Z) - MatFormer: Nested Transformer for Elastic Inference [91.45687988953435]
MatFormer is a novel Transformer architecture designed to provide elastic inference across diverse deployment constraints.
MatFormer achieves this by incorporating a nested Feed Forward Network (FFN) block structure within a standard Transformer model.
We show that a 850M decoder-only MatFormer language model (MatLM) allows us to extract multiple smaller models spanning from 582M to 850M parameters.
arXiv Detail & Related papers (2023-10-11T17:57:14Z) - Deformable Mixer Transformer with Gating for Multi-Task Learning of
Dense Prediction [126.34551436845133]
CNNs and Transformers have their own advantages and both have been widely used for dense prediction in multi-task learning (MTL)
We present a novel MTL model by combining both merits of deformable CNN and query-based Transformer with shared gating for multi-task learning of dense prediction.
arXiv Detail & Related papers (2023-08-10T17:37:49Z) - Dynamic Multi-Branch Layers for On-Device Neural Machine Translation [53.637479651600586]
We propose to improve the performance of on-device neural machine translation (NMT) systems with dynamic multi-branch layers.
Specifically, we design a layer-wise dynamic multi-branch network with only one branch activated during training and inference.
At almost the same computational cost, our method achieves improvements of up to 1.7 BLEU points on the WMT14 English-German translation task and 1.8 BLEU points on the WMT20 Chinese-English translation task.
arXiv Detail & Related papers (2021-05-14T07:32:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.