Towards Being Parameter-Efficient: A Stratified Sparsely Activated
Transformer with Dynamic Capacity
- URL: http://arxiv.org/abs/2305.02176v2
- Date: Sun, 22 Oct 2023 21:09:23 GMT
- Title: Towards Being Parameter-Efficient: A Stratified Sparsely Activated
Transformer with Dynamic Capacity
- Authors: Haoran Xu, Maha Elbayad, Kenton Murray, Jean Maillard and Vedanuj
Goswami
- Abstract summary: Stratified Mixture of Experts (SMoE) models can assign dynamic capacity to different tokens.
We show SMoE outperforms multiple state-of-the-art MoE models with the same or fewer parameters.
- Score: 37.04254056062765
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mixture-of-experts (MoE) models that employ sparse activation have
demonstrated effectiveness in significantly increasing the number of parameters
while maintaining low computational requirements per token. However, recent
studies have established that MoE models are inherently parameter-inefficient
as the improvement in performance diminishes with an increasing number of
experts. We hypothesize this parameter inefficiency is a result of all experts
having equal capacity, which may not adequately meet the varying complexity
requirements of different tokens or tasks. In light of this, we propose
Stratified Mixture of Experts (SMoE) models, which feature a stratified
structure and can assign dynamic capacity to different tokens. We demonstrate
the effectiveness of SMoE on three multilingual machine translation benchmarks,
containing 4, 15, and 94 language pairs, respectively. We show that SMoE
outperforms multiple state-of-the-art MoE models with the same or fewer
parameters.
Related papers
- HMoE: Heterogeneous Mixture of Experts for Language Modeling [45.65121689677227]
Traditionally, Mixture of Experts (MoE) models use homogeneous experts, each with identical capacity.
We propose a novel Heterogeneous Mixture of Experts (HMoE) where experts differ in size and thus possess diverse capacities.
HMoE achieves lower loss with fewer activated parameters and outperforms conventional homogeneous MoE models on various pre-training evaluation benchmarks.
arXiv Detail & Related papers (2024-08-20T09:35:24Z) - Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts [75.85448576746373]
We propose a method of grouping and pruning similar experts to improve the model's parameter efficiency.
We validate the effectiveness of our method by pruning three state-of-the-art MoE architectures.
The evaluation shows that our method outperforms other model pruning methods on a range of natural language tasks.
arXiv Detail & Related papers (2024-07-12T17:25:02Z) - Unchosen Experts Can Contribute Too: Unleashing MoE Models' Power by Self-Contrast [58.98411447739218]
Mixture-of-Experts (MoE) has emerged as a prominent architecture for scaling model size while maintaining computational efficiency.
We propose Self-Contrast Mixture-of-Experts (SCMoE), a training-free strategy that utilizes unchosen experts in a self-contrast manner during inference.
Our method is conceptually simple and computationally lightweight, as it incurs minimal latency compared to greedy decoding.
arXiv Detail & Related papers (2024-05-23T12:45:29Z) - Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models [33.834215393960605]
We introduce the Dynamic Mixture of Experts (DynMoE) technique to enhance the efficiency of training and inference for Transformer-based foundational models.
DynMoE incorporates a novel gating method that enables each token to automatically determine the number of experts to activate.
Our results demonstrate the effectiveness of our approach to achieve competitive performance compared to GMoE for vision and language tasks, and MoE-LLaVA for vision-language tasks.
arXiv Detail & Related papers (2024-05-23T08:18:30Z) - Multi-Head Mixture-of-Experts [100.60556163597946]
We propose Multi-Head Mixture-of-Experts (MH-MoE), which employs a multi-head mechanism to split each token into multiple sub-tokens.
MH-MoE is straightforward to implement and decouples from other SMoE optimization methods, making it easy to integrate with other SMoE models for enhanced performance.
arXiv Detail & Related papers (2024-04-23T13:47:09Z) - XMoE: Sparse Models with Fine-grained and Adaptive Expert Selection [30.687511115573038]
tool is a novel MoE designed to enhance both the efficacy and efficiency of sparse MoE models.
tool can enhance model performance while decreasing the computation load at MoE layers by over 50% without sacrificing performance.
arXiv Detail & Related papers (2024-02-27T08:18:02Z) - Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks [5.536630285985836]
We introduce parameter-efficient sparsity crafting (PESC)
PESC crafts dense models into sparse models using the mixture-of-experts (MoE) architecture.
Our best sparse model outperforms other sparse and dense models and exhibits superior general capabilities compared to GP3.5.
arXiv Detail & Related papers (2024-01-05T09:58:09Z) - Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained
Language Models [68.9288651177564]
We present a novel MoE architecture based on matrix product operators (MPO) from quantum many-body physics.
With the decomposed MPO structure, we can reduce the parameters of the original MoE architecture.
Experiments on the three well-known downstream natural language datasets based on GPT2 show improved performance and efficiency in increasing model capacity.
arXiv Detail & Related papers (2022-03-02T13:44:49Z) - MoEfication: Conditional Computation of Transformer Models for Efficient
Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost.
We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon.
We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.