MoDE: A Mixture-of-Experts Model with Mutual Distillation among the
Experts
- URL: http://arxiv.org/abs/2402.00893v1
- Date: Wed, 31 Jan 2024 03:52:32 GMT
- Title: MoDE: A Mixture-of-Experts Model with Mutual Distillation among the
Experts
- Authors: Zhitian Xie, Yinger Zhang, Chenyi Zhuang, Qitao Shi, Zhining Liu,
Jinjie Gu, Guannan Zhang
- Abstract summary: We propose a method called Mixture-of-Distilled-Expert (MoDE)
MoDE applies moderate mutual distillation among experts to enable each expert to pick up more features learned by other experts.
- Score: 15.535613294871487
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The application of mixture-of-experts (MoE) is gaining popularity due to its
ability to improve model's performance. In an MoE structure, the gate layer
plays a significant role in distinguishing and routing input features to
different experts. This enables each expert to specialize in processing their
corresponding sub-tasks. However, the gate's routing mechanism also gives rise
to narrow vision: the individual MoE's expert fails to use more samples in
learning the allocated sub-task, which in turn limits the MoE to further
improve its generalization ability. To effectively address this, we propose a
method called Mixture-of-Distilled-Expert (MoDE), which applies moderate mutual
distillation among experts to enable each expert to pick up more features
learned by other experts and gain more accurate perceptions on their original
allocated sub-tasks. We conduct plenty experiments including tabular, NLP and
CV datasets, which shows MoDE's effectiveness, universality and robustness.
Furthermore, we develop a parallel study through innovatively constructing
"expert probing", to experimentally prove why MoDE works: moderate distilling
knowledge can improve each individual expert's test performances on their
assigned tasks, leading to MoE's overall performance improvement.
Related papers
- Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts [75.85448576746373]
We propose a method of grouping and pruning similar experts to improve model's parameter efficiency.
We validate our method by pruning two state-of-the-art MoE models, Mixtral-8x7B and Mixtral-8x22B.
Our method outperforms other model pruning methods on a range of natural language tasks.
arXiv Detail & Related papers (2024-07-12T17:25:02Z) - Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models [24.915387910764082]
Expert-Specialized Fine-Tuning, or ESFT, tunes the experts most relevant to downstream tasks while freezing the other experts and modules.
MoE models with finer-grained experts are more advantageous in selecting the combination of experts that are most relevant to downstream tasks.
arXiv Detail & Related papers (2024-07-02T03:11:13Z) - Unchosen Experts Can Contribute Too: Unleashing MoE Models' Power by Self-Contrast [58.98411447739218]
Mixture-of-Experts (MoE) has emerged as a prominent architecture for scaling model size while maintaining computational efficiency.
We propose Self-Contrast Mixture-of-Experts (SCMoE), a training-free strategy that utilizes unchosen experts in a self-contrast manner during inference.
Our method is conceptually simple and computationally lightweight, as it incurs minimal latency compared to greedy decoding.
arXiv Detail & Related papers (2024-05-23T12:45:29Z) - Harder Tasks Need More Experts: Dynamic Routing in MoE Models [58.18526590138739]
We introduce a novel dynamic expert selection framework for Mixture of Experts (MoE) models.
Our method dynamically selects experts based on the confidence level in expert selection for each input.
arXiv Detail & Related papers (2024-03-12T13:41:15Z) - HyperMoE: Towards Better Mixture of Experts via Transferring Among Experts [25.504602853436047]
Mixture of Experts (MoE) for language models has been proven effective in augmenting the capacity of models by dynamically routing each input token to a specific subset of experts for processing.
We propose HyperMoE, a novel MoE framework built upon Hypernetworks.
This framework integrates the computational processes of MoE with the concept of knowledge transferring in multi-task learning.
arXiv Detail & Related papers (2024-02-20T02:09:55Z) - Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts [74.40198929049959]
Large multi-modal models (LMMs) exhibit remarkable performance across numerous tasks.
generalist LMMs often suffer from performance degradation when tuned over a large collection of tasks.
We propose Omni-SMoLA, an architecture that uses the Soft MoE approach to mix many multimodal low rank experts.
arXiv Detail & Related papers (2023-12-01T23:04:27Z) - Diversifying the Mixture-of-Experts Representation for Language Models
with Orthogonal Optimizer [62.41501243027603]
The Mixture of Experts (MoE) has emerged as a highly successful technique in deep learning.
In this study, we shed light on the homogeneous representation problem, wherein experts in the MoE fail to specialize and lack diversity.
We propose a straightforward yet highly effective solution: OMoE, an expert entity.
arXiv Detail & Related papers (2023-10-15T07:20:28Z) - Improving Expert Specialization in Mixture of Experts [0.7366405857677227]
Mixture of experts (MoE) is the simplest gated modular neural network architecture.
We show that the original MoE architecture and its training method do not guarantee intuitive task decompositions and good expert utilization.
We introduce a novel gating architecture, similar to attention, that improves performance and results in a lower entropy task decomposition.
arXiv Detail & Related papers (2023-02-28T16:16:45Z) - MoEC: Mixture of Expert Clusters [93.63738535295866]
Sparsely Mixture of Experts (MoE) has received great interest due to its promising scaling capability with affordable computational overhead.
MoE converts dense layers into sparse experts, and utilizes a gated routing network to make experts conditionally activated.
However, as the number of experts grows, MoE with outrageous parameters suffers from overfitting and sparse data allocation.
arXiv Detail & Related papers (2022-07-19T06:09:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.