MaskMoE: Boosting Token-Level Learning via Routing Mask in Mixture-of-Experts
- URL: http://arxiv.org/abs/2407.09816v1
- Date: Sat, 13 Jul 2024 09:22:33 GMT
- Title: MaskMoE: Boosting Token-Level Learning via Routing Mask in Mixture-of-Experts
- Authors: Zhenpeng Su, Zijia Lin, Xue Bai, Xing Wu, Yizhe Xiong, Haoran Lian, Guangyuan Ma, Hui Chen, Guiguang Ding, Wei Zhou, Songlin Hu,
- Abstract summary: Mixture-of-Experts models (MoEs) address this by allowing model capacity to scale without substantially increasing training or inference costs.
We propose MaskMoE, a method designed to enhance token-level learning by employing a routing masking technique within the MoE model.
- Score: 38.15244333975921
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scaling model capacity enhances its capabilities but significantly increases computation. Mixture-of-Experts models (MoEs) address this by allowing model capacity to scale without substantially increasing training or inference costs. Despite their promising results, MoE models encounter several challenges. Primarily, the dispersion of training tokens across multiple experts can lead to underfitting, particularly for infrequent tokens. Additionally, while fixed routing mechanisms can mitigate this issue, they compromise on the diversity of representations. In this paper, we propose MaskMoE, a method designed to enhance token-level learning by employing a routing masking technique within the Mixture-of-Experts model. MaskMoE is capable of maintaining representation diversity while achieving more comprehensive training. Experimental results demonstrate that our method outperforms previous dominant Mixture-of-Experts models in both perplexity (PPL) and downstream tasks.
Related papers
- Unchosen Experts Can Contribute Too: Unleashing MoE Models' Power by Self-Contrast [58.98411447739218]
Mixture-of-Experts (MoE) has emerged as a prominent architecture for scaling model size while maintaining computational efficiency.
We propose Self-Contrast Mixture-of-Experts (SCMoE), a training-free strategy that utilizes unchosen experts in a self-contrast manner during inference.
Our method is conceptually simple and computationally lightweight, as it incurs minimal latency compared to greedy decoding.
arXiv Detail & Related papers (2024-05-23T12:45:29Z) - CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts [41.80218225636109]
CuMo improves model scalability during training while keeping inference costs similar to those of smaller models.
CuMo incorporates sparsely-gated Mixture-of-Experts blocks into both the vision encoder and the connector.
The code and model weights for CuMo are open-sourced at https://github.com/SHI-Labs/CuMo.
arXiv Detail & Related papers (2024-05-09T17:37:20Z) - Multi-Head Mixture-of-Experts [100.60556163597946]
We propose Multi-Head Mixture-of-Experts (MH-MoE), which employs a multi-head mechanism to split each token into multiple sub-tokens.
MH-MoE is straightforward to implement and decouples from other SMoE optimization methods, making it easy to integrate with other SMoE models for enhanced performance.
arXiv Detail & Related papers (2024-04-23T13:47:09Z) - FuseMoE: Mixture-of-Experts Transformers for Fleximodal Fusion [29.130355774088205]
FuseMoE is a mixture-of-experts framework incorporated with an innovative gating function.
Designed to integrate a diverse number of modalities, FuseMoE is effective in managing scenarios with missing modalities and irregularly sampled data trajectories.
arXiv Detail & Related papers (2024-02-05T17:37:46Z) - Mixture of Tokens: Efficient LLMs through Cross-Example Aggregation [0.9618396291860722]
Mixture of Experts (MoE) models increase parameter counts of Transformer models while maintaining training and inference costs.
MoE models are prone to issues like training instability and uneven expert utilization.
We propose a fully-differentiable model that retains the benefits of MoE architectures while avoiding the aforementioned difficulties.
arXiv Detail & Related papers (2023-10-24T16:03:57Z) - CL-MAE: Curriculum-Learned Masked Autoencoders [49.24994655813455]
We propose a curriculum learning approach that updates the masking strategy to continually increase the complexity of the self-supervised reconstruction task.
We train our Curriculum-Learned Masked Autoencoder (CL-MAE) on ImageNet and show that it exhibits superior representation learning capabilities compared to MAE.
arXiv Detail & Related papers (2023-08-31T09:13:30Z) - Improving Masked Autoencoders by Learning Where to Mask [65.89510231743692]
Masked image modeling is a promising self-supervised learning method for visual data.
We present AutoMAE, a framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process.
In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.
arXiv Detail & Related papers (2023-03-12T05:28:55Z) - Masked Autoencoding for Scalable and Generalizable Decision Making [93.84855114717062]
MaskDP is a simple and scalable self-supervised pretraining method for reinforcement learning and behavioral cloning.
We find that a MaskDP model gains the capability of zero-shot transfer to new BC tasks, such as single and multiple goal reaching.
arXiv Detail & Related papers (2022-11-23T07:04:41Z) - Mixture of Attention Heads: Selecting Attention Heads Per Token [40.04159325505842]
Mixture of Attention Heads (MoA) is a new architecture that combines multi-head attention with the MoE mechanism.
MoA achieves stronger performance than the standard multi-head attention layer.
MoA also automatically differentiates heads' utilities, providing a new perspective to discuss the model's interpretability.
arXiv Detail & Related papers (2022-10-11T04:54:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.