On the Representation Collapse of Sparse Mixture of Experts
- URL: http://arxiv.org/abs/2204.09179v1
- Date: Wed, 20 Apr 2022 01:40:19 GMT
- Title: On the Representation Collapse of Sparse Mixture of Experts
- Authors: Zewen Chi, Li Dong, Shaohan Huang, Damai Dai, Shuming Ma, Barun Patra,
Saksham Singhal, Payal Bajaj, Xia Song, Furu Wei
- Abstract summary: Sparse mixture of experts provides larger model capacity while requiring a constant computational overhead.
It employs the routing mechanism to distribute input tokens to the best-matched experts according to their hidden representations.
However, learning such a routing mechanism encourages token clustering around expert centroids, implying a trend toward representation collapse.
- Score: 102.83396489230375
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sparse mixture of experts provides larger model capacity while requiring a
constant computational overhead. It employs the routing mechanism to distribute
input tokens to the best-matched experts according to their hidden
representations. However, learning such a routing mechanism encourages token
clustering around expert centroids, implying a trend toward representation
collapse. In this work, we propose to estimate the routing scores between
tokens and experts on a low-dimensional hypersphere. We conduct extensive
experiments on cross-lingual language model pre-training and fine-tuning on
downstream tasks. Experimental results across seven multilingual benchmarks
show that our method achieves consistent gains. We also present a comprehensive
analysis on the representation and routing behaviors of our models. Our method
alleviates the representation collapse issue and achieves more consistent
routing than the baseline mixture-of-experts methods.
Related papers
- Mixture of Efficient Diffusion Experts Through Automatic Interval and Sub-Network Selection [63.96018203905272]
We propose to reduce the sampling cost by pruning a pretrained diffusion model into a mixture of efficient experts.
We demonstrate the effectiveness of our method, DiffPruning, across several datasets.
arXiv Detail & Related papers (2024-09-23T21:27:26Z) - Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts [75.85448576746373]
We propose a method of grouping and pruning similar experts to improve the model's parameter efficiency.
We validate the effectiveness of our method by pruning three state-of-the-art MoE architectures.
The evaluation shows that our method outperforms other model pruning methods on a range of natural language tasks.
arXiv Detail & Related papers (2024-07-12T17:25:02Z) - READ: Improving Relation Extraction from an ADversarial Perspective [33.44949503459933]
We propose an adversarial training method specifically designed for relation extraction (RE)
Our approach introduces both sequence- and token-level perturbations to the sample and uses a separate perturbation vocabulary to improve the search for entity and context perturbations.
arXiv Detail & Related papers (2024-04-02T16:42:44Z) - Prompting Diffusion Representations for Cross-Domain Semantic
Segmentation [101.04326113360342]
diffusion-pretraining achieves extraordinary domain generalization results for semantic segmentation.
We introduce a scene prompt and a prompt randomization strategy to help further disentangle the domain-invariant information when training the segmentation head.
arXiv Detail & Related papers (2023-07-05T09:28:25Z) - Diversified Dynamic Routing for Vision Tasks [36.199659460868496]
We propose a novel architecture where each layer is composed of a set of experts.
In our method, the model is explicitly trained to solve the challenge of finding relevant partitioning of the data.
We conduct several experiments on semantic segmentation on Cityscapes and object detection and instance segmentation on MS-COCO.
arXiv Detail & Related papers (2022-09-26T23:27:51Z) - MoEC: Mixture of Expert Clusters [93.63738535295866]
Sparsely Mixture of Experts (MoE) has received great interest due to its promising scaling capability with affordable computational overhead.
MoE converts dense layers into sparse experts, and utilizes a gated routing network to make experts conditionally activated.
However, as the number of experts grows, MoE with outrageous parameters suffers from overfitting and sparse data allocation.
arXiv Detail & Related papers (2022-07-19T06:09:55Z) - An Additive Instance-Wise Approach to Multi-class Model Interpretation [53.87578024052922]
Interpretable machine learning offers insights into what factors drive a certain prediction of a black-box system.
Existing methods mainly focus on selecting explanatory input features, which follow either locally additive or instance-wise approaches.
This work exploits the strengths of both methods and proposes a global framework for learning local explanations simultaneously for multiple target classes.
arXiv Detail & Related papers (2022-07-07T06:50:27Z) - Local Competition and Stochasticity for Adversarial Robustness in Deep
Learning [8.023314613846418]
This work addresses adversarial robustness in deep learning by considering deep networks with local winner-takes-all activations.
This type of network units result in sparse representations from each model layer, as the units are organized in blocks where only one unit generates a non-zero output.
arXiv Detail & Related papers (2021-01-04T17:40:52Z) - Making Neural Networks Interpretable with Attribution: Application to
Implicit Signals Prediction [11.427019313283997]
We propose a novel formulation of interpretable deep neural networks for the attribution task.
Using masked weights, hidden features can be deeply attributed, split into several input-restricted sub-networks and trained as a boosted mixture of experts.
arXiv Detail & Related papers (2020-08-26T06:46:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.