Mixture-of-Experts with Expert Choice Routing
- URL: http://arxiv.org/abs/2202.09368v1
- Date: Fri, 18 Feb 2022 17:46:11 GMT
- Title: Mixture-of-Experts with Expert Choice Routing
- Authors: Yanqi Zhou and Tao Lei and Hanxiao Liu and Nan Du and Yanping Huang
and Vincent Zhao and Andrew Dai and Zhifeng Chen and Quoc Le and James Laudon
- Abstract summary: Prior work allocates a fixed number of experts to each token using a top-k function.
We propose a heterogeneous mixture-of-experts employing an expert choice method.
Our method improves training convergence time by more than 2x.
- Score: 44.777850078713634
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Sparsely-activated Mixture-of-experts (MoE) models allow the number of
parameters to greatly increase while keeping the amount of computation for a
given token or a given sample unchanged. However, a poor expert routing
strategy (e.g. one resulting in load imbalance) can cause certain experts to be
under-trained, leading to an expert being under or over-specialized. Prior work
allocates a fixed number of experts to each token using a top-k function
regardless of the relative importance of different tokens. To address this, we
propose a heterogeneous mixture-of-experts employing an expert choice method.
Instead of letting tokens select the top-k experts, we have experts selecting
the top-k tokens. As a result, each token can be routed to a variable number of
experts and each expert can have a fixed bucket size. We systematically study
pre-training speedups using the same computational resources of the Switch
Transformer top-1 and GShard top-2 gating of prior work and find that our
method improves training convergence time by more than 2x. For the same
computational cost, our method demonstrates higher performance in fine-tuning
11 selected tasks in the GLUE and SuperGLUE benchmarks. For a smaller
activation cost, our method outperforms the T5 dense model in 7 out of the 11
tasks.
Related papers
- Mixture of Parrots: Experts improve memorization more than reasoning [72.445819694797]
We show that as we increase the number of experts, the memorization performance consistently increases while the reasoning capabilities saturate.
We find that increasing the number of experts helps solve knowledge-intensive tasks, but fails to yield the same benefits for reasoning tasks.
arXiv Detail & Related papers (2024-10-24T17:54:41Z) - MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts [63.67734699877724]
MoE++ is a general and heterogeneous MoE framework that integrates both Feed-Forward Network(FFN) and zero-computation experts.
MoE++ achieves better performance while delivering 1.1-2.1x expert forward throughput compared to a vanilla MoE model of the same size.
arXiv Detail & Related papers (2024-10-09T18:01:27Z) - Mixture of Diverse Size Experts [13.29015039603752]
The Sparsely-Activated Mixture-of-Experts (MoE) has gained increasing popularity for scaling up large language models (LLMs) without exploding computational costs.
We propose the Mixture of Diverse Size Experts (MoDSE), a new MoE architecture with layers designed to have experts of different sizes.
arXiv Detail & Related papers (2024-09-18T08:23:27Z) - Beyond Parameter Count: Implicit Bias in Soft Mixture of Experts [44.09546603624385]
We introduce a notion of expert specialization for Soft MoE.
We show that when there are many small experts, the architecture is implicitly biased in a fashion that allows us to efficiently approximate the specialized expert subset.
arXiv Detail & Related papers (2024-09-02T00:39:00Z) - Harder Tasks Need More Experts: Dynamic Routing in MoE Models [58.18526590138739]
We introduce a novel dynamic expert selection framework for Mixture of Experts (MoE) models.
Our method dynamically selects experts based on the confidence level in expert selection for each input.
arXiv Detail & Related papers (2024-03-12T13:41:15Z) - Active Ranking of Experts Based on their Performances in Many Tasks [72.96112117037465]
We consider the problem of ranking n experts based on their performances on d tasks.
We make a monotonicity assumption stating that for each pair of experts, one outperforms the other on all tasks.
arXiv Detail & Related papers (2023-06-05T06:55:39Z) - Entry Dependent Expert Selection in Distributed Gaussian Processes Using
Multilabel Classification [12.622412402489951]
An ensemble technique combines local predictions from Gaussian experts trained on different partitions of the data.
This paper proposes a flexible expert selection approach based on the characteristics of entry data points.
arXiv Detail & Related papers (2022-11-17T23:23:26Z) - A Mixture of $h-1$ Heads is Better than $h$ Heads [63.12336930345417]
We propose the mixture of attentive experts model (MAE)
Experiments on machine translation and language modeling show that MAE outperforms strong baselines on both tasks.
Our analysis shows that our model learns to specialize different experts to different inputs.
arXiv Detail & Related papers (2020-05-13T19:05:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.