Mixture-of-Experts with Expert Choice Routing
- URL: http://arxiv.org/abs/2202.09368v1
- Date: Fri, 18 Feb 2022 17:46:11 GMT
- Title: Mixture-of-Experts with Expert Choice Routing
- Authors: Yanqi Zhou and Tao Lei and Hanxiao Liu and Nan Du and Yanping Huang
and Vincent Zhao and Andrew Dai and Zhifeng Chen and Quoc Le and James Laudon
- Abstract summary: Prior work allocates a fixed number of experts to each token using a top-k function.
We propose a heterogeneous mixture-of-experts employing an expert choice method.
Our method improves training convergence time by more than 2x.
- Score: 44.777850078713634
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Sparsely-activated Mixture-of-experts (MoE) models allow the number of
parameters to greatly increase while keeping the amount of computation for a
given token or a given sample unchanged. However, a poor expert routing
strategy (e.g. one resulting in load imbalance) can cause certain experts to be
under-trained, leading to an expert being under or over-specialized. Prior work
allocates a fixed number of experts to each token using a top-k function
regardless of the relative importance of different tokens. To address this, we
propose a heterogeneous mixture-of-experts employing an expert choice method.
Instead of letting tokens select the top-k experts, we have experts selecting
the top-k tokens. As a result, each token can be routed to a variable number of
experts and each expert can have a fixed bucket size. We systematically study
pre-training speedups using the same computational resources of the Switch
Transformer top-1 and GShard top-2 gating of prior work and find that our
method improves training convergence time by more than 2x. For the same
computational cost, our method demonstrates higher performance in fine-tuning
11 selected tasks in the GLUE and SuperGLUE benchmarks. For a smaller
activation cost, our method outperforms the T5 dense model in 7 out of the 11
tasks.
Related papers
- Solving Token Gradient Conflict in Mixture-of-Experts for Large Vision-Language Model [10.682263930467196]
The Mixture-of-Experts (MoE) has gained increasing attention in the study of Large Vision-Language Models (LVLMs)
Existing MoE methods in LVLMs encourage different experts to handle different tokens, and thus they employ a router to predict the routing for each token.
This paper proposes a novel method based on token-level gradient analysis.
arXiv Detail & Related papers (2024-06-28T13:20:17Z) - Harder Tasks Need More Experts: Dynamic Routing in MoE Models [58.18526590138739]
We introduce a novel dynamic expert selection framework for Mixture of Experts (MoE) models.
Our method dynamically selects experts based on the confidence level in expert selection for each input.
arXiv Detail & Related papers (2024-03-12T13:41:15Z) - Multilinear Mixture of Experts: Scalable Expert Specialization through Factorization [51.98792406392873]
The Mixture of Experts (MoE) paradigm provides a powerful way to decompose dense layers into smaller, modular computations.
A major challenge lies in the computational cost of scaling the number of experts high enough to achieve fine-grained specialization.
We propose the Multilinear Mixture of Experts ($mu$MoE) layer to address this, focusing on vision models.
arXiv Detail & Related papers (2024-02-19T21:20:22Z) - Divide and not forget: Ensemble of selectively trained experts in Continual Learning [0.2886273197127056]
Class-incremental learning is becoming more popular as it helps models widen their applicability while not forgetting what they already know.
A trend in this area is to use a mixture-of-expert technique, where different models work together to solve the task.
SEED selects only one, the most optimal expert for a considered task, and uses data from this task to fine-tune only this expert.
arXiv Detail & Related papers (2024-01-18T18:25:29Z) - Active Ranking of Experts Based on their Performances in Many Tasks [72.96112117037465]
We consider the problem of ranking n experts based on their performances on d tasks.
We make a monotonicity assumption stating that for each pair of experts, one outperforms the other on all tasks.
arXiv Detail & Related papers (2023-06-05T06:55:39Z) - Entry Dependent Expert Selection in Distributed Gaussian Processes Using
Multilabel Classification [12.622412402489951]
An ensemble technique combines local predictions from Gaussian experts trained on different partitions of the data.
This paper proposes a flexible expert selection approach based on the characteristics of entry data points.
arXiv Detail & Related papers (2022-11-17T23:23:26Z) - BASE Layers: Simplifying Training of Large, Sparse Models [53.98145464002843]
We introduce a new balanced assignment of experts (BASE) layer for large language models.
Sparse layers can dramatically improve the efficiency of training and inference by routing each token to specialized expert modules.
We formulate token-to-expert allocation as a linear assignment problem, allowing an optimal assignment in which each expert receives an equal number of tokens.
arXiv Detail & Related papers (2021-03-30T23:08:32Z) - A Mixture of $h-1$ Heads is Better than $h$ Heads [63.12336930345417]
We propose the mixture of attentive experts model (MAE)
Experiments on machine translation and language modeling show that MAE outperforms strong baselines on both tasks.
Our analysis shows that our model learns to specialize different experts to different inputs.
arXiv Detail & Related papers (2020-05-13T19:05:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.