Related papers: MaskMoE: Boosting Token-Level Learning via Routing Mask in Mixture-of-Experts

MaskMoE: Boosting Token-Level Learning via Routing Mask in Mixture-of-Experts

URL: http://arxiv.org/abs/2407.09816v4
Date: Thu, 29 Aug 2024 08:45:58 GMT
Title: MaskMoE: Boosting Token-Level Learning via Routing Mask in Mixture-of-Experts
Authors: Zhenpeng Su, Zijia Lin, Xue Bai, Xing Wu, Yizhe Xiong, Haoran Lian, Guangyuan Ma, Hui Chen, Guiguang Ding, Wei Zhou, Songlin Hu,
Abstract summary: MaskMoE is capable of maintaining representation diversity while achieving more comprehensive training. Our method outperforms previous dominant Mixture-of-Experts models in terms of both perplexity (PPL) and downstream task performance.
Score: 38.15244333975921
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Scaling the size of a model enhances its capabilities but significantly increases computation complexity. Mixture-of-Experts models (MoE) address the issue by allowing model size to scale up without substantially increasing training or inference costs. In MoE, there is an important module called the router, which is used to distribute each token to the experts. Currently, the mainstream routing methods include dynamic routing and fixed routing. Despite their promising results, MoE models encounter several challenges. Primarily, for dynamic routing methods, the dispersion of training tokens across multiple experts can lead to underfitting, particularly for infrequent tokens. Additionally, though fixed routing methods can mitigate that issue, they compromise on the diversity of representations. In this paper, we propose \textbf{MaskMoE}, a method designed to enhance token-level learning by employing a routing \textbf{mask}ing technique within the \textbf{M}ixture-\textbf{o}f-\textbf{E}xperts model. MaskMoE is capable of maintaining representation diversity while achieving more comprehensive training. Experimental results demonstrate that our method outperforms previous dominant Mixture-of-Experts models in terms of both perplexity (PPL) and downstream task performance.

Related papers

MoTE: Mixture of Task-specific Experts for Pre-Trained ModelBased Class-incremental Learning [39.892628170627496]
Class-incremental learning (CIL) requires deep learning models to continuously acquire new knowledge from streaming data.<n> prompt-based approaches suffer from prompt overwriting, while adapter-based methods face challenges such as dimensional misalignment between tasks.<n>We propose a mixture of task-specific experts (MoTE) framework that effectively mitigates the miscalibration caused by inconsistent output dimensions.
arXiv Detail & Related papers (2025-05-21T03:06:10Z)
Improving Routing in Sparse Mixture of Experts with Graph of Tokens [32.46693871593765]
We unveil the limitation of Sparse Mixture of Experts (SMoE) through the perspective of the probabilistic graphical model (PGM)<n>We propose the novel Similarity-Aware (S)MoE, which considers interactions between tokens during expert selection.<n>We empirically validate our models on various tasks and domains, showing significant improvements in reducing routing fluctuations.
arXiv Detail & Related papers (2025-05-01T18:44:20Z)
DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers [86.5541501589166]
DiffMoE is a batch-level global token pool that enables experts to access global token distributions during training. It achieves state-of-the-art performance among diffusion models on ImageNet benchmark. The effectiveness of our approach extends beyond class-conditional generation to more challenging tasks such as text-to-image generation.
arXiv Detail & Related papers (2025-03-18T17:57:07Z)
Attention Is All You Need For Mixture-of-Depths Routing [5.419910566904439]
We introduce a novel attention-based routing mechanism A-MoD. A-MoD allows for more efficient training as it introduces no additional trainable parameters. It can increase the performance of the MoD model.
arXiv Detail & Related papers (2024-12-30T11:25:54Z)
Efficient and Effective Weight-Ensembling Mixture of Experts for Multi-Task Model Merging [111.8456671452411]
Multi-task learning (MTL) leverages a shared model to accomplish multiple tasks and facilitate knowledge transfer. We propose a Weight-Ensembling Mixture of Experts (WEMoE) method for multi-task model merging. We show that WEMoE and E-WEMoE outperform state-of-the-art (SOTA) model merging methods in terms of MTL performance, generalization, and robustness.
arXiv Detail & Related papers (2024-10-29T07:16:31Z)
Layerwise Recurrent Router for Mixture-of-Experts [42.36093735411238]
Mixture-of-Experts (MoE) architecture stands out for its ability to scale model size without significantly increasing training costs. Current MoE models often display parameter inefficiency. We introduce the Layerwise Recurrent Router for Mixture-of-Experts (RMoE)
arXiv Detail & Related papers (2024-08-13T10:25:13Z)
Merging Multi-Task Models via Weight-Ensembling Mixture of Experts [64.94129594112557]
Merging Transformer-based models trained on different tasks into a single unified model can execute all the tasks concurrently. Previous methods, exemplified by task arithmetic, have been proven to be both effective and scalable. We propose to merge most of the parameters while upscaling the Transformer layers to a weight-ensembling mixture of experts (MoE) module.
arXiv Detail & Related papers (2024-02-01T08:58:57Z)
Mixture of Tokens: Continuous MoE through Cross-Example Aggregation [0.7880651741080428]
Mixture of Experts (MoE) models are pushing the boundaries of language and vision tasks. MoT is a simple, continuous architecture that is capable of scaling the number of parameters similarly to sparse MoE models. Our best models achieve a 3x increase in training speed over dense Transformer models in language pretraining.
arXiv Detail & Related papers (2023-10-24T16:03:57Z)
Domain Generalization via Balancing Training Difficulty and Model Capability [61.053202176230904]
Domain generalization (DG) aims to learn domain-generalizable models from one or multiple source domains that can perform well in unseen target domains. Despite its recent progress, most existing work suffers from the misalignment between the difficulty level of training samples and the capability of contemporarily trained models. We design MoDify, a Momentum Difficulty framework that tackles the misalignment by balancing the seesaw between the model's capability and the samples' difficulties.
arXiv Detail & Related papers (2023-09-02T07:09:23Z)
SMILE: Scaling Mixture-of-Experts with Efficient Bi-level Routing [47.11171833082974]
We introduce SMILE, which exploits heterogeneous network bandwidth and splits a single-step routing into bi-level routing. Our experimental results show that the proposed method obtains a 2.5x speedup over Switch Transformer in terms of pretraining throughput on the Colossal Clean Crawled Corpus without losing any convergence speed.
arXiv Detail & Related papers (2022-12-10T03:44:16Z)
Masked Autoencoding for Scalable and Generalizable Decision Making [93.84855114717062]
MaskDP is a simple and scalable self-supervised pretraining method for reinforcement learning and behavioral cloning. We find that a MaskDP model gains the capability of zero-shot transfer to new BC tasks, such as single and multiple goal reaching.
arXiv Detail & Related papers (2022-11-23T07:04:41Z)
StableMoE: Stable Routing Strategy for Mixture of Experts [109.0602120199226]
Mixture-of-Experts (MoE) technique can scale up the model size of Transformers with an affordable computational overhead. We propose StableMoE with two training stages to address the routing fluctuation problem. Results show that StableMoE outperforms existing MoE methods in terms of both convergence speed and performance.
arXiv Detail & Related papers (2022-04-18T16:48:19Z)
KSM: Fast Multiple Task Adaption via Kernel-wise Soft Mask Learning [49.77278179376902]
Deep Neural Networks (DNN) could forget the knowledge about earlier tasks when learning new tasks, and this is known as textitcatastrophic forgetting. Recent continual learning methods are capable of alleviating the catastrophic problem on toy-sized datasets. We propose a new training method called textit- Kernel-wise Soft Mask (KSM), which learns a kernel-wise hybrid binary and real-value soft mask for each task.
arXiv Detail & Related papers (2020-09-11T21:48:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.