MoDE: A Mixture-of-Experts Model with Mutual Distillation among the
Experts
- URL: http://arxiv.org/abs/2402.00893v1
- Date: Wed, 31 Jan 2024 03:52:32 GMT
- Title: MoDE: A Mixture-of-Experts Model with Mutual Distillation among the
Experts
- Authors: Zhitian Xie, Yinger Zhang, Chenyi Zhuang, Qitao Shi, Zhining Liu,
Jinjie Gu, Guannan Zhang
- Abstract summary: We propose a method called Mixture-of-Distilled-Expert (MoDE)
MoDE applies moderate mutual distillation among experts to enable each expert to pick up more features learned by other experts.
- Score: 15.535613294871487
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The application of mixture-of-experts (MoE) is gaining popularity due to its
ability to improve model's performance. In an MoE structure, the gate layer
plays a significant role in distinguishing and routing input features to
different experts. This enables each expert to specialize in processing their
corresponding sub-tasks. However, the gate's routing mechanism also gives rise
to narrow vision: the individual MoE's expert fails to use more samples in
learning the allocated sub-task, which in turn limits the MoE to further
improve its generalization ability. To effectively address this, we propose a
method called Mixture-of-Distilled-Expert (MoDE), which applies moderate mutual
distillation among experts to enable each expert to pick up more features
learned by other experts and gain more accurate perceptions on their original
allocated sub-tasks. We conduct plenty experiments including tabular, NLP and
CV datasets, which shows MoDE's effectiveness, universality and robustness.
Furthermore, we develop a parallel study through innovatively constructing
"expert probing", to experimentally prove why MoDE works: moderate distilling
knowledge can improve each individual expert's test performances on their
assigned tasks, leading to MoE's overall performance improvement.
Related papers
- ViMoE: An Empirical Study of Designing Vision Mixture-of-Experts [71.11994027685974]
We integrate the MoE structure into the classic Vision Transformer (ViT), naming it ViMoE, and explore the potential of applying MoE to vision.
We observe that the performance is sensitive to the configuration of MoE layers, making it challenging to obtain optimal results without careful design.
To address this, we introduce a shared expert to learn and capture common information, serving as an effective way to construct stable ViMoE.
arXiv Detail & Related papers (2024-10-21T07:51:17Z) - HMoE: Heterogeneous Mixture of Experts for Language Modeling [45.65121689677227]
Traditionally, Mixture of Experts (MoE) models use homogeneous experts, each with identical capacity.
We propose a novel Heterogeneous Mixture of Experts (HMoE) where experts differ in size and thus possess diverse capacities.
HMoE achieves lower loss with fewer activated parameters and outperforms conventional homogeneous MoE models on various pre-training evaluation benchmarks.
arXiv Detail & Related papers (2024-08-20T09:35:24Z) - HoME: Hierarchy of Multi-Gate Experts for Multi-Task Learning at Kuaishou [19.113649341888532]
We present the practical problems and the lessons learned at short-video services from Kuaishou.
In industry, a widely-used multi-task framework is the Mixture-of-Experts (MoE) paradigm.
arXiv Detail & Related papers (2024-08-10T04:25:48Z) - Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts [75.85448576746373]
We propose a method of grouping and pruning similar experts to improve the model's parameter efficiency.
We validate the effectiveness of our method by pruning three state-of-the-art MoE architectures.
The evaluation shows that our method outperforms other model pruning methods on a range of natural language tasks.
arXiv Detail & Related papers (2024-07-12T17:25:02Z) - Harder Tasks Need More Experts: Dynamic Routing in MoE Models [58.18526590138739]
We introduce a novel dynamic expert selection framework for Mixture of Experts (MoE) models.
Our method dynamically selects experts based on the confidence level in expert selection for each input.
arXiv Detail & Related papers (2024-03-12T13:41:15Z) - HyperMoE: Towards Better Mixture of Experts via Transferring Among Experts [25.504602853436047]
Mixture of Experts (MoE) for language models has been proven effective in augmenting the capacity of models by dynamically routing each input token to a specific subset of experts for processing.
We propose HyperMoE, a novel MoE framework built upon Hypernetworks.
This framework integrates the computational processes of MoE with the concept of knowledge transferring in multi-task learning.
arXiv Detail & Related papers (2024-02-20T02:09:55Z) - Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts [74.40198929049959]
Large multi-modal models (LMMs) exhibit remarkable performance across numerous tasks.
generalist LMMs often suffer from performance degradation when tuned over a large collection of tasks.
We propose Omni-SMoLA, an architecture that uses the Soft MoE approach to mix many multimodal low rank experts.
arXiv Detail & Related papers (2023-12-01T23:04:27Z) - MoEC: Mixture of Expert Clusters [93.63738535295866]
Sparsely Mixture of Experts (MoE) has received great interest due to its promising scaling capability with affordable computational overhead.
MoE converts dense layers into sparse experts, and utilizes a gated routing network to make experts conditionally activated.
However, as the number of experts grows, MoE with outrageous parameters suffers from overfitting and sparse data allocation.
arXiv Detail & Related papers (2022-07-19T06:09:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.