Towards Being Parameter-Efficient: A Stratified Sparsely Activated
Transformer with Dynamic Capacity
- URL: http://arxiv.org/abs/2305.02176v2
- Date: Sun, 22 Oct 2023 21:09:23 GMT
- Title: Towards Being Parameter-Efficient: A Stratified Sparsely Activated
Transformer with Dynamic Capacity
- Authors: Haoran Xu, Maha Elbayad, Kenton Murray, Jean Maillard and Vedanuj
Goswami
- Abstract summary: Stratified Mixture of Experts (SMoE) models can assign dynamic capacity to different tokens.
We show SMoE outperforms multiple state-of-the-art MoE models with the same or fewer parameters.
- Score: 37.04254056062765
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mixture-of-experts (MoE) models that employ sparse activation have
demonstrated effectiveness in significantly increasing the number of parameters
while maintaining low computational requirements per token. However, recent
studies have established that MoE models are inherently parameter-inefficient
as the improvement in performance diminishes with an increasing number of
experts. We hypothesize this parameter inefficiency is a result of all experts
having equal capacity, which may not adequately meet the varying complexity
requirements of different tokens or tasks. In light of this, we propose
Stratified Mixture of Experts (SMoE) models, which feature a stratified
structure and can assign dynamic capacity to different tokens. We demonstrate
the effectiveness of SMoE on three multilingual machine translation benchmarks,
containing 4, 15, and 94 language pairs, respectively. We show that SMoE
outperforms multiple state-of-the-art MoE models with the same or fewer
parameters.
Related papers
- Unchosen Experts Can Contribute Too: Unleashing MoE Models' Power by Self-Contrast [58.98411447739218]
Mixture-of-Experts (MoE) has emerged as a prominent architecture for scaling model size while maintaining computational efficiency.
We propose Self-Contrast Mixture-of-Experts (SCMoE), a training-free strategy that utilizes unchosen experts in a self-contrast manner during inference.
Our method is conceptually simple and computationally lightweight, as it incurs minimal latency compared to greedy decoding.
arXiv Detail & Related papers (2024-05-23T12:45:29Z) - Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models [4.109351791494196]
We introduce the Dynamic Mixture of Experts (DynMoE) technique to enhance the efficiency of training and inference for Transformer-based foundational models.
DynMoE incorporates a novel gating method that enables each token to automatically determine the number of experts to activate.
Our results demonstrate the effectiveness of our approach to achieve competitive performance compared to GMoE for vision and language tasks, and MoE-LLaVA for vision-language tasks.
arXiv Detail & Related papers (2024-05-23T08:18:30Z) - Multi-Head Mixture-of-Experts [100.60556163597946]
We propose Multi-Head Mixture-of-Experts (MH-MoE), which employs a multi-head mechanism to split each token into multiple sub-tokens.
MH-MoE is straightforward to implement and decouples from other SMoE optimization methods, making it easy to integrate with other SMoE models for enhanced performance.
arXiv Detail & Related papers (2024-04-23T13:47:09Z) - XMoE: Sparse Models with Fine-grained and Adaptive Expert Selection [30.687511115573038]
tool is a novel MoE designed to enhance both the efficacy and efficiency of sparse MoE models.
tool can enhance model performance while decreasing the computation load at MoE layers by over 50% without sacrificing performance.
arXiv Detail & Related papers (2024-02-27T08:18:02Z) - Mixture-of-Expert Conformer for Streaming Multilingual ASR [33.14594179710925]
We propose a streaming truly multilingual Conformer incorporating mixture-of-expert layers.
The proposed MoE layer offers efficient inference by activating a fixed number of parameters as the number of experts increases.
We evaluate the proposed model on a set of 12 languages, and achieve an average 11.9% relative improvement in WER over the baseline.
arXiv Detail & Related papers (2023-05-25T02:16:32Z) - Parameter-Efficient Conformers via Sharing Sparsely-Gated Experts for
End-to-End Speech Recognition [17.73449206184214]
This paper proposes a parameter-efficient conformer via sharing sparsely-gated experts.
Specifically, we use sparsely-gated mixture-of-experts (MoE) to extend the capacity of a conformer block without increasing.
arXiv Detail & Related papers (2022-09-17T13:22:19Z) - Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained
Language Models [68.9288651177564]
We present a novel MoE architecture based on matrix product operators (MPO) from quantum many-body physics.
With the decomposed MPO structure, we can reduce the parameters of the original MoE architecture.
Experiments on the three well-known downstream natural language datasets based on GPT2 show improved performance and efficiency in increasing model capacity.
arXiv Detail & Related papers (2022-03-02T13:44:49Z) - Sparse MoEs meet Efficient Ensembles [49.313497379189315]
We study the interplay of two popular classes of such models: ensembles of neural networks and sparse mixture of experts (sparse MoEs)
We present Efficient Ensemble of Experts (E$3$), a scalable and simple ensemble of sparse MoEs that takes the best of both classes of models, while using up to 45% fewer FLOPs than a deep ensemble.
arXiv Detail & Related papers (2021-10-07T11:58:35Z) - MoEfication: Conditional Computation of Transformer Models for Efficient
Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost.
We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon.
We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.