Routers in Vision Mixture of Experts: An Empirical Study
- URL: http://arxiv.org/abs/2401.15969v2
- Date: Thu, 18 Apr 2024 18:48:31 GMT
- Title: Routers in Vision Mixture of Experts: An Empirical Study
- Authors: Tianlin Liu, Mathieu Blondel, Carlos Riquelme, Joan Puigcerver,
- Abstract summary: Mixture-of-Experts (MoE) models are a promising way to scale up model capacity without significantly increasing computational cost.
Key component of MoEs is the router, which decides which subset of parameters (experts) process which feature embeddings (tokens)
- Score: 26.51711534240885
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Mixture-of-Experts (MoE) models are a promising way to scale up model capacity without significantly increasing computational cost. A key component of MoEs is the router, which decides which subset of parameters (experts) process which feature embeddings (tokens). In this paper, we present a comprehensive study of routers in MoEs for computer vision tasks. We introduce a unified MoE formulation that subsumes different MoEs with two parametric routing tensors. This formulation covers both sparse MoE, which uses a binary or hard assignment between experts and tokens, and soft MoE, which uses a soft assignment between experts and weighted combinations of tokens. Routers for sparse MoEs can be further grouped into two variants: Token Choice, which matches experts to each token, and Expert Choice, which matches tokens to each expert. We conduct head-to-head experiments with 6 different routers, including existing routers from prior work and new ones we introduce. We show that (i) many routers originally developed for language modeling can be adapted to perform strongly in vision tasks, (ii) in sparse MoE, Expert Choice routers generally outperform Token Choice routers, and (iii) soft MoEs generally outperform sparse MoEs with a fixed compute budget. These results provide new insights regarding the crucial role of routers in vision MoE models.
Related papers
- ViMoE: An Empirical Study of Designing Vision Mixture-of-Experts [71.11994027685974]
We integrate the MoE structure into the classic Vision Transformer (ViT), naming it ViMoE, and explore the potential of applying MoE to vision.
We observe that the performance is sensitive to the configuration of MoE layers, making it challenging to obtain optimal results without careful design.
To address this, we introduce a shared expert to learn and capture common information, serving as an effective way to construct stable ViMoE.
arXiv Detail & Related papers (2024-10-21T07:51:17Z) - DA-MoE: Towards Dynamic Expert Allocation for Mixture-of-Experts Models [1.4255659581428335]
We propose a novel Dynamically Allocates a variable number of experts for Mixture-of-Experts (DA-MoE) models based on an effective token importance measure.
Our approach consistently outperforms the state-of-the-art Transformer based MoE model on the popular GLUE benchmark.
arXiv Detail & Related papers (2024-09-10T17:36:15Z) - MaskMoE: Boosting Token-Level Learning via Routing Mask in Mixture-of-Experts [38.15244333975921]
MaskMoE is capable of maintaining representation diversity while achieving more comprehensive training.
Our method outperforms previous dominant Mixture-of-Experts models in terms of both perplexity (PPL) and downstream task performance.
arXiv Detail & Related papers (2024-07-13T09:22:33Z) - A Closer Look into Mixture-of-Experts in Large Language Models [26.503570706063634]
Mixture-of-experts (MoE) is gaining increasing attention due to its unique properties and remarkable performance.
MoE architecture could increase the model size without sacrificing computational efficiency.
We make an initial attempt to understand the inner workings of MoE-based large language models.
arXiv Detail & Related papers (2024-06-26T10:07:57Z) - A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-Experts [49.394145046409044]
This paper provides the first provably efficient technique for pruning experts in finetuned MoE models.
We theoretically prove that prioritizing the pruning of the experts with a smaller change of the routers l2 norm from the pretrained model guarantees the preservation of test accuracy.
Although our theoretical analysis is centered on binary classification tasks on simplified MoE architecture, our expert pruning method is verified on large vision MoE models.
arXiv Detail & Related papers (2024-05-26T17:52:58Z) - Patch-level Routing in Mixture-of-Experts is Provably Sample-efficient
for Convolutional Neural Networks [74.68583356645276]
In deep learning, mixture-of-experts (MoE) activates one or few experts (sub-networks) on a per-sample or per-token basis.
We show for the first time that pMoE provably reduces the required number of training samples to achieve desirable generalization.
arXiv Detail & Related papers (2023-06-07T00:16:10Z) - AutoMoE: Heterogeneous Mixture-of-Experts with Adaptive Computation for
Efficient Neural Machine Translation [104.0979785739202]
Mixture-of-Expert (MoE) models have obtained state-of-the-art performance in Neural Machine Translation (NMT) tasks.
Existing MoE models mostly consider a homogeneous design where the same number of experts of the same size are placed uniformly throughout the network.
We develop AutoMoE -- a framework for designing heterogeneous MoE's under computational constraints.
arXiv Detail & Related papers (2022-10-14T05:32:17Z) - Mixture of Attention Heads: Selecting Attention Heads Per Token [40.04159325505842]
Mixture of Attention Heads (MoA) is a new architecture that combines multi-head attention with the MoE mechanism.
MoA achieves stronger performance than the standard multi-head attention layer.
MoA also automatically differentiates heads' utilities, providing a new perspective to discuss the model's interpretability.
arXiv Detail & Related papers (2022-10-11T04:54:05Z) - Task-Specific Expert Pruning for Sparse Mixture-of-Experts [105.20605021416276]
Mixture-of-Experts (MoE) model is powerful for large-scale pre-training.
MoE is hard to be deployed on cloud or mobile environment.
We propose a general method to progressively drop the non-professional experts for the target downstream task.
arXiv Detail & Related papers (2022-06-01T07:09:01Z) - StableMoE: Stable Routing Strategy for Mixture of Experts [109.0602120199226]
Mixture-of-Experts (MoE) technique can scale up the model size of Transformers with an affordable computational overhead.
We propose StableMoE with two training stages to address the routing fluctuation problem.
Results show that StableMoE outperforms existing MoE methods in terms of both convergence speed and performance.
arXiv Detail & Related papers (2022-04-18T16:48:19Z) - SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture
of Experts [29.582683923988203]
Mixture of Experts (MoE) based Transformer has shown promising results in many domains.
In this work, we explore the MoE based model for speech recognition, named SpeechMoE.
New router architecture is used in SpeechMoE which can simultaneously utilize the information from a shared embedding network.
arXiv Detail & Related papers (2021-05-07T02:38:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.