On the Representation Collapse of Sparse Mixture of Experts
- URL: http://arxiv.org/abs/2204.09179v1
- Date: Wed, 20 Apr 2022 01:40:19 GMT
- Title: On the Representation Collapse of Sparse Mixture of Experts
- Authors: Zewen Chi, Li Dong, Shaohan Huang, Damai Dai, Shuming Ma, Barun Patra,
Saksham Singhal, Payal Bajaj, Xia Song, Furu Wei
- Abstract summary: Sparse mixture of experts provides larger model capacity while requiring a constant computational overhead.
It employs the routing mechanism to distribute input tokens to the best-matched experts according to their hidden representations.
However, learning such a routing mechanism encourages token clustering around expert centroids, implying a trend toward representation collapse.
- Score: 102.83396489230375
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sparse mixture of experts provides larger model capacity while requiring a
constant computational overhead. It employs the routing mechanism to distribute
input tokens to the best-matched experts according to their hidden
representations. However, learning such a routing mechanism encourages token
clustering around expert centroids, implying a trend toward representation
collapse. In this work, we propose to estimate the routing scores between
tokens and experts on a low-dimensional hypersphere. We conduct extensive
experiments on cross-lingual language model pre-training and fine-tuning on
downstream tasks. Experimental results across seven multilingual benchmarks
show that our method achieves consistent gains. We also present a comprehensive
analysis on the representation and routing behaviors of our models. Our method
alleviates the representation collapse issue and achieves more consistent
routing than the baseline mixture-of-experts methods.
Related papers
- READ: Improving Relation Extraction from an ADversarial Perspective [33.44949503459933]
We propose an adversarial training method specifically designed for relation extraction (RE)
Our approach introduces both sequence- and token-level perturbations to the sample and uses a separate perturbation vocabulary to improve the search for entity and context perturbations.
arXiv Detail & Related papers (2024-04-02T16:42:44Z) - Generalization Error Analysis for Sparse Mixture-of-Experts: A Preliminary Study [65.11303133775857]
Mixture-of-Experts (MoE) computation amalgamates predictions from several specialized sub-models (referred to as experts)
Sparse MoE selectively engages only a limited number, or even just one expert, significantly reducing overhead while empirically preserving, and sometimes even enhancing, performance.
arXiv Detail & Related papers (2024-03-26T05:48:02Z) - Prompting Diffusion Representations for Cross-Domain Semantic
Segmentation [101.04326113360342]
diffusion-pretraining achieves extraordinary domain generalization results for semantic segmentation.
We introduce a scene prompt and a prompt randomization strategy to help further disentangle the domain-invariant information when training the segmentation head.
arXiv Detail & Related papers (2023-07-05T09:28:25Z) - Soft Merging of Experts with Adaptive Routing [38.962451264172856]
We introduce Soft Merging of Experts with Adaptive Routing (SMEAR)
SMEAR avoids discrete routing by using a single "merged" expert constructed via a weighted average of all of the experts' parameters.
We empirically validate that models using SMEAR outperform models that route based on metadata or learn sparse routing through gradient estimation.
arXiv Detail & Related papers (2023-06-06T15:04:31Z) - Diversified Dynamic Routing for Vision Tasks [36.199659460868496]
We propose a novel architecture where each layer is composed of a set of experts.
In our method, the model is explicitly trained to solve the challenge of finding relevant partitioning of the data.
We conduct several experiments on semantic segmentation on Cityscapes and object detection and instance segmentation on MS-COCO.
arXiv Detail & Related papers (2022-09-26T23:27:51Z) - MoEC: Mixture of Expert Clusters [93.63738535295866]
Sparsely Mixture of Experts (MoE) has received great interest due to its promising scaling capability with affordable computational overhead.
MoE converts dense layers into sparse experts, and utilizes a gated routing network to make experts conditionally activated.
However, as the number of experts grows, MoE with outrageous parameters suffers from overfitting and sparse data allocation.
arXiv Detail & Related papers (2022-07-19T06:09:55Z) - An Additive Instance-Wise Approach to Multi-class Model Interpretation [53.87578024052922]
Interpretable machine learning offers insights into what factors drive a certain prediction of a black-box system.
Existing methods mainly focus on selecting explanatory input features, which follow either locally additive or instance-wise approaches.
This work exploits the strengths of both methods and proposes a global framework for learning local explanations simultaneously for multiple target classes.
arXiv Detail & Related papers (2022-07-07T06:50:27Z) - Bayesian Graph Contrastive Learning [55.36652660268726]
We propose a novel perspective of graph contrastive learning methods showing random augmentations leads to encoders.
Our proposed method represents each node by a distribution in the latent space in contrast to existing techniques which embed each node to a deterministic vector.
We show a considerable improvement in performance compared to existing state-of-the-art methods on several benchmark datasets.
arXiv Detail & Related papers (2021-12-15T01:45:32Z) - Local Competition and Stochasticity for Adversarial Robustness in Deep
Learning [8.023314613846418]
This work addresses adversarial robustness in deep learning by considering deep networks with local winner-takes-all activations.
This type of network units result in sparse representations from each model layer, as the units are organized in blocks where only one unit generates a non-zero output.
arXiv Detail & Related papers (2021-01-04T17:40:52Z) - Making Neural Networks Interpretable with Attribution: Application to
Implicit Signals Prediction [11.427019313283997]
We propose a novel formulation of interpretable deep neural networks for the attribution task.
Using masked weights, hidden features can be deeply attributed, split into several input-restricted sub-networks and trained as a boosted mixture of experts.
arXiv Detail & Related papers (2020-08-26T06:46:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.