A Mixture of $h-1$ Heads is Better than $h$ Heads
- URL: http://arxiv.org/abs/2005.06537v1
- Date: Wed, 13 May 2020 19:05:58 GMT
- Title: A Mixture of $h-1$ Heads is Better than $h$ Heads
- Authors: Hao Peng, Roy Schwartz, Dianqi Li, and Noah A. Smith
- Abstract summary: We propose the mixture of attentive experts model (MAE)
Experiments on machine translation and language modeling show that MAE outperforms strong baselines on both tasks.
Our analysis shows that our model learns to specialize different experts to different inputs.
- Score: 63.12336930345417
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-head attentive neural architectures have achieved state-of-the-art
results on a variety of natural language processing tasks. Evidence has shown
that they are overparameterized; attention heads can be pruned without
significant performance loss. In this work, we instead "reallocate" them -- the
model learns to activate different heads on different inputs. Drawing
connections between multi-head attention and mixture of experts, we propose the
mixture of attentive experts model (MAE). MAE is trained using a block
coordinate descent algorithm that alternates between updating (1) the
responsibilities of the experts and (2) their parameters. Experiments on
machine translation and language modeling show that MAE outperforms strong
baselines on both tasks. Particularly, on the WMT14 English to German
translation dataset, MAE improves over "transformer-base" by 0.8 BLEU, with a
comparable number of parameters. Our analysis shows that our model learns to
specialize different experts to different inputs.
Related papers
- Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts [75.85448576746373]
We propose a method of grouping and pruning similar experts to improve the model's parameter efficiency.
We validate the effectiveness of our method by pruning three state-of-the-art MoE architectures.
The evaluation shows that our method outperforms other model pruning methods on a range of natural language tasks.
arXiv Detail & Related papers (2024-07-12T17:25:02Z) - Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models [33.834215393960605]
We introduce the Dynamic Mixture of Experts (DynMoE) technique to enhance the efficiency of training and inference for Transformer-based foundational models.
DynMoE incorporates a novel gating method that enables each token to automatically determine the number of experts to activate.
Our results demonstrate the effectiveness of our approach to achieve competitive performance compared to GMoE for vision and language tasks, and MoE-LLaVA for vision-language tasks.
arXiv Detail & Related papers (2024-05-23T08:18:30Z) - Multi-Head Mixture-of-Experts [100.60556163597946]
We propose Multi-Head Mixture-of-Experts (MH-MoE), which employs a multi-head mechanism to split each token into multiple sub-tokens.
MH-MoE is straightforward to implement and decouples from other SMoE optimization methods, making it easy to integrate with other SMoE models for enhanced performance.
arXiv Detail & Related papers (2024-04-23T13:47:09Z) - Harder Tasks Need More Experts: Dynamic Routing in MoE Models [58.18526590138739]
We introduce a novel dynamic expert selection framework for Mixture of Experts (MoE) models.
Our method dynamically selects experts based on the confidence level in expert selection for each input.
arXiv Detail & Related papers (2024-03-12T13:41:15Z) - Diversifying the Mixture-of-Experts Representation for Language Models with Orthogonal Optimizer [59.43462055143123]
The Mixture of Experts (MoE) has emerged as a highly successful technique in deep learning.
In this study, we shed light on the homogeneous representation problem, wherein experts in the MoE fail to specialize and lack diversity.
We propose an alternating training strategy that encourages each expert to update in a direction to the subspace spanned by other experts.
arXiv Detail & Related papers (2023-10-15T07:20:28Z) - Mixture-of-Experts with Expert Choice Routing [44.777850078713634]
Prior work allocates a fixed number of experts to each token using a top-k function.
We propose a heterogeneous mixture-of-experts employing an expert choice method.
Our method improves training convergence time by more than 2x.
arXiv Detail & Related papers (2022-02-18T17:46:11Z) - Taming Sparsely Activated Transformer with Stochastic Experts [76.0711573018493]
Sparsely activated models (SAMs) can easily scale to have outrageously large amounts of parameters without significant increase in computational cost.
In this paper, we propose a new expert-based model, THOR (Transformer witH StOchastic ExpeRts)
Unlike classic expert-based models, such as the Switch Transformer, experts in THOR are randomly activated for each input during training and inference.
arXiv Detail & Related papers (2021-10-08T17:15:47Z) - Cascaded Head-colliding Attention [28.293881246428377]
Transformers have advanced the field of natural language processing (NLP) on a variety of important tasks.
We present cascaded head-colliding attention (CODA) which explicitly models the interactions between attention heads through a hierarchical variational distribution.
arXiv Detail & Related papers (2021-05-31T10:06:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.