Related papers: On the Representation Collapse of Sparse Mixture of Experts

On the Representation Collapse of Sparse Mixture of Experts

URL: http://arxiv.org/abs/2204.09179v1
Date: Wed, 20 Apr 2022 01:40:19 GMT
Title: On the Representation Collapse of Sparse Mixture of Experts
Authors: Zewen Chi, Li Dong, Shaohan Huang, Damai Dai, Shuming Ma, Barun Patra, Saksham Singhal, Payal Bajaj, Xia Song, Furu Wei
Abstract summary: Sparse mixture of experts provides larger model capacity while requiring a constant computational overhead. It employs the routing mechanism to distribute input tokens to the best-matched experts according to their hidden representations. However, learning such a routing mechanism encourages token clustering around expert centroids, implying a trend toward representation collapse.
Score: 102.83396489230375
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Sparse mixture of experts provides larger model capacity while requiring a constant computational overhead. It employs the routing mechanism to distribute input tokens to the best-matched experts according to their hidden representations. However, learning such a routing mechanism encourages token clustering around expert centroids, implying a trend toward representation collapse. In this work, we propose to estimate the routing scores between tokens and experts on a low-dimensional hypersphere. We conduct extensive experiments on cross-lingual language model pre-training and fine-tuning on downstream tasks. Experimental results across seven multilingual benchmarks show that our method achieves consistent gains. We also present a comprehensive analysis on the representation and routing behaviors of our models. Our method alleviates the representation collapse issue and achieves more consistent routing than the baseline mixture-of-experts methods.

Related papers

RouteMark: A Fingerprint for Intellectual Property Attribution in Routing-based Model Merging [69.2230254959204]
We propose RouteMark, a framework for IP protection in merged MoE models.<n>Our key insight is that task-specific experts exhibit stable and distinctive routing behaviors under probing inputs.<n>For attribution and tampering detection, we introduce a similarity-based matching algorithm.
arXiv Detail & Related papers (2025-08-03T14:51:58Z)
Load Balancing Mixture of Experts with Similarity Preserving Routers [37.348178220494226]
Sparse Mixture of Experts (MoE) models offer a scalable and efficient architecture for training large neural networks.<n>We introduce a novel load balancing loss that preserves token-wise relational structure.<n>Our results show that applying our loss to the router results in 36% faster convergence and lower redundancy.
arXiv Detail & Related papers (2025-06-16T22:22:59Z)
Expert Race: A Flexible Routing Strategy for Scaling Diffusion Transformer with Mixture of Experts [33.39800923804871]
We introduce Race-DiT, a novel MoE model for diffusion transformers with a flexible routing strategy, Expert Race. By allowing tokens and experts to compete together and select the top candidates, the model learns to dynamically assign experts to critical tokens.
arXiv Detail & Related papers (2025-03-20T11:45:08Z)
On the effectiveness of discrete representations in sparse mixture of experts [33.809432499123275]
We propose a new architecture dubbed Vector-Quantized Mixture of Experts (VQMoE) VQMoE is an effective solution for scaling up model capacity without increasing the computational costs. We show that VQMoE achieves a 28% improvement in routers compared to other SMoE routing methods.
arXiv Detail & Related papers (2024-11-28T22:32:01Z)
Mixture of Efficient Diffusion Experts Through Automatic Interval and Sub-Network Selection [63.96018203905272]
We propose to reduce the sampling cost by pruning a pretrained diffusion model into a mixture of efficient experts. We demonstrate the effectiveness of our method, DiffPruning, across several datasets.
arXiv Detail & Related papers (2024-09-23T21:27:26Z)
Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts [75.85448576746373]
We propose a method of grouping and pruning similar experts to improve the model's parameter efficiency. We validate the effectiveness of our method by pruning three state-of-the-art MoE architectures. The evaluation shows that our method outperforms other model pruning methods on a range of natural language tasks.
arXiv Detail & Related papers (2024-07-12T17:25:02Z)
READ: Improving Relation Extraction from an ADversarial Perspective [33.44949503459933]
We propose an adversarial training method specifically designed for relation extraction (RE) Our approach introduces both sequence- and token-level perturbations to the sample and uses a separate perturbation vocabulary to improve the search for entity and context perturbations.
arXiv Detail & Related papers (2024-04-02T16:42:44Z)
Prompting Diffusion Representations for Cross-Domain Semantic Segmentation [101.04326113360342]
diffusion-pretraining achieves extraordinary domain generalization results for semantic segmentation. We introduce a scene prompt and a prompt randomization strategy to help further disentangle the domain-invariant information when training the segmentation head.
arXiv Detail & Related papers (2023-07-05T09:28:25Z)
Diversified Dynamic Routing for Vision Tasks [36.199659460868496]
We propose a novel architecture where each layer is composed of a set of experts. In our method, the model is explicitly trained to solve the challenge of finding relevant partitioning of the data. We conduct several experiments on semantic segmentation on Cityscapes and object detection and instance segmentation on MS-COCO.
arXiv Detail & Related papers (2022-09-26T23:27:51Z)
MoEC: Mixture of Expert Clusters [93.63738535295866]
Sparsely Mixture of Experts (MoE) has received great interest due to its promising scaling capability with affordable computational overhead. MoE converts dense layers into sparse experts, and utilizes a gated routing network to make experts conditionally activated. However, as the number of experts grows, MoE with outrageous parameters suffers from overfitting and sparse data allocation.
arXiv Detail & Related papers (2022-07-19T06:09:55Z)
An Additive Instance-Wise Approach to Multi-class Model Interpretation [53.87578024052922]
Interpretable machine learning offers insights into what factors drive a certain prediction of a black-box system. Existing methods mainly focus on selecting explanatory input features, which follow either locally additive or instance-wise approaches. This work exploits the strengths of both methods and proposes a global framework for learning local explanations simultaneously for multiple target classes.
arXiv Detail & Related papers (2022-07-07T06:50:27Z)
Local Competition and Stochasticity for Adversarial Robustness in Deep Learning [8.023314613846418]
This work addresses adversarial robustness in deep learning by considering deep networks with local winner-takes-all activations. This type of network units result in sparse representations from each model layer, as the units are organized in blocks where only one unit generates a non-zero output.
arXiv Detail & Related papers (2021-01-04T17:40:52Z)
Making Neural Networks Interpretable with Attribution: Application to Implicit Signals Prediction [11.427019313283997]
We propose a novel formulation of interpretable deep neural networks for the attribution task. Using masked weights, hidden features can be deeply attributed, split into several input-restricted sub-networks and trained as a boosted mixture of experts.
arXiv Detail & Related papers (2020-08-26T06:46:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.