Towards Understanding Mixture of Experts in Deep Learning
- URL: http://arxiv.org/abs/2208.02813v1
- Date: Thu, 4 Aug 2022 17:59:10 GMT
- Title: Towards Understanding Mixture of Experts in Deep Learning
- Authors: Zixiang Chen and Yihe Deng and Yue Wu and Quanquan Gu and Yuanzhi Li
- Abstract summary: We study how the MoE layer improves the performance of neural network learning.
Our results suggest that the cluster structure of the underlying problem and the non-linearity of the expert are pivotal to the success of MoE.
- Score: 95.27215939891511
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Mixture-of-Experts (MoE) layer, a sparsely-activated model controlled by
a router, has achieved great success in deep learning. However, the
understanding of such architecture remains elusive. In this paper, we formally
study how the MoE layer improves the performance of neural network learning and
why the mixture model will not collapse into a single model. Our empirical
results suggest that the cluster structure of the underlying problem and the
non-linearity of the expert are pivotal to the success of MoE. To further
understand this, we consider a challenging classification problem with
intrinsic cluster structures, which is hard to learn using a single expert. Yet
with the MoE layer, by choosing the experts as two-layer nonlinear
convolutional neural networks (CNNs), we show that the problem can be learned
successfully. Furthermore, our theory shows that the router can learn the
cluster-center features, which helps divide the input complex problem into
simpler linear classification sub-problems that individual experts can conquer.
To our knowledge, this is the first result towards formally understanding the
mechanism of the MoE layer for deep learning.
Related papers
- Theory on Mixture-of-Experts in Continual Learning [72.42497633220547]
Continual learning (CL) has garnered significant attention because of its ability to adapt to new tasks that arrive over time.
Catastrophic forgetting (of old tasks) has been identified as a major issue in CL, as the model adapts to new tasks.
MoE model has recently been shown to effectively mitigate catastrophic forgetting in CL, by employing a gating network.
arXiv Detail & Related papers (2024-06-24T08:29:58Z) - Layer-wise Linear Mode Connectivity [52.6945036534469]
Averaging neural network parameters is an intuitive method for the knowledge of two independent models.
It is most prominently used in federated learning.
We analyse the performance of the models that result from averaging single, or groups.
arXiv Detail & Related papers (2023-07-13T09:39:10Z) - MoEC: Mixture of Expert Clusters [93.63738535295866]
Sparsely Mixture of Experts (MoE) has received great interest due to its promising scaling capability with affordable computational overhead.
MoE converts dense layers into sparse experts, and utilizes a gated routing network to make experts conditionally activated.
However, as the number of experts grows, MoE with outrageous parameters suffers from overfitting and sparse data allocation.
arXiv Detail & Related papers (2022-07-19T06:09:55Z) - Stacked unsupervised learning with a network architecture found by
supervised meta-learning [4.209801809583906]
Stacked unsupervised learning seems more biologically plausible than backpropagation.
But SUL has fallen far short of backpropagation in practical applications.
We show an SUL algorithm that can perform completely unsupervised clustering of MNIST digits.
arXiv Detail & Related papers (2022-06-06T16:17:20Z) - Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained
Language Models [68.9288651177564]
We present a novel MoE architecture based on matrix product operators (MPO) from quantum many-body physics.
With the decomposed MPO structure, we can reduce the parameters of the original MoE architecture.
Experiments on the three well-known downstream natural language datasets based on GPT2 show improved performance and efficiency in increasing model capacity.
arXiv Detail & Related papers (2022-03-02T13:44:49Z) - Mixture of ELM based experts with trainable gating network [2.320417845168326]
We propose an ensemble learning method based on mixture of experts.
The structure of ME consists of multi layer perceptrons (MLPs) as base experts and gating network.
In the proposed method a trainable gating network is applied to aggregate the outputs of the experts.
arXiv Detail & Related papers (2021-05-25T07:13:35Z) - Pseudo-supervised Deep Subspace Clustering [27.139553299302754]
Auto-Encoder (AE)-based deep subspace clustering (DSC) methods have achieved impressive performance.
However, self-reconstruction loss of an AE ignores rich useful relation information.
It is also challenging to learn high-level similarity without feeding semantic labels.
arXiv Detail & Related papers (2021-04-08T06:25:47Z) - Gradient-based Competitive Learning: Theory [1.6752712949948443]
This paper introduces a novel perspective in this area by combining gradient-based and competitive learning.
The theory is based on the intuition that neural networks are able to learn topological structures by working directly on the transpose of the input matrix.
The proposed approach has a great potential as it can be generalized to a vast selection of topological learning tasks.
arXiv Detail & Related papers (2020-09-06T19:00:51Z) - Understanding Deep Architectures with Reasoning Layer [60.90906477693774]
We show that properties of the algorithm layers, such as convergence, stability, and sensitivity, are intimately related to the approximation and generalization abilities of the end-to-end model.
Our theory can provide useful guidelines for designing deep architectures with reasoning layers.
arXiv Detail & Related papers (2020-06-24T00:26:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.