Improving Expert Specialization in Mixture of Experts
- URL: http://arxiv.org/abs/2302.14703v1
- Date: Tue, 28 Feb 2023 16:16:45 GMT
- Title: Improving Expert Specialization in Mixture of Experts
- Authors: Yamuna Krishnamurthy and Chris Watkins and Thomas Gaertner
- Abstract summary: Mixture of experts (MoE) is the simplest gated modular neural network architecture.
We show that the original MoE architecture and its training method do not guarantee intuitive task decompositions and good expert utilization.
We introduce a novel gating architecture, similar to attention, that improves performance and results in a lower entropy task decomposition.
- Score: 0.7366405857677227
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Mixture of experts (MoE), introduced over 20 years ago, is the simplest gated
modular neural network architecture. There is renewed interest in MoE because
the conditional computation allows only parts of the network to be used during
each inference, as was recently demonstrated in large scale natural language
processing models. MoE is also of potential interest for continual learning, as
experts may be reused for new tasks, and new experts introduced. The gate in
the MoE architecture learns task decompositions and individual experts learn
simpler functions appropriate to the gate's decomposition. In this paper: (1)
we show that the original MoE architecture and its training method do not
guarantee intuitive task decompositions and good expert utilization, indeed
they can fail spectacularly even for simple data such as MNIST and
FashionMNIST; (2) we introduce a novel gating architecture, similar to
attention, that improves performance and results in a lower entropy task
decomposition; and (3) we introduce a novel data-driven regularization that
improves expert specialization. We empirically validate our methods on MNIST,
FashionMNIST and CIFAR-100 datasets.
Related papers
- MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts [63.67734699877724]
MoE++ is a general and heterogeneous MoE framework that integrates both Feed-Forward Network(FFN) and zero-computation experts.
MoE++ achieves better performance while delivering 1.1-2.1x expert forward throughput compared to a vanilla MoE model of the same size.
arXiv Detail & Related papers (2024-10-09T18:01:27Z) - Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts [75.85448576746373]
We propose a method of grouping and pruning similar experts to improve the model's parameter efficiency.
We validate the effectiveness of our method by pruning three state-of-the-art MoE architectures.
The evaluation shows that our method outperforms other model pruning methods on a range of natural language tasks.
arXiv Detail & Related papers (2024-07-12T17:25:02Z) - A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-Experts [49.394145046409044]
This paper provides the first provably efficient technique for pruning experts in finetuned MoE models.
We theoretically prove that prioritizing the pruning of the experts with a smaller change of the routers l2 norm from the pretrained model guarantees the preservation of test accuracy.
Although our theoretical analysis is centered on binary classification tasks on simplified MoE architecture, our expert pruning method is verified on large vision MoE models.
arXiv Detail & Related papers (2024-05-26T17:52:58Z) - SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts [49.01990048827639]
We introduce SEER-MoE, a framework for reducing both the memory footprint and compute requirements of pre-trained MoE models.
The first stage involves pruning the total number of experts using a heavy-hitters counting guidance, while the second stage employs a regularization-based fine-tuning strategy to recover accuracy loss.
Our empirical studies demonstrate the effectiveness of our method, resulting in a sparse MoEs model optimized for inference efficiency with minimal accuracy trade-offs.
arXiv Detail & Related papers (2024-04-07T22:13:43Z) - MoDE: A Mixture-of-Experts Model with Mutual Distillation among the
Experts [15.535613294871487]
We propose a method called Mixture-of-Distilled-Expert (MoDE)
MoDE applies moderate mutual distillation among experts to enable each expert to pick up more features learned by other experts.
arXiv Detail & Related papers (2024-01-31T03:52:32Z) - MoEC: Mixture of Expert Clusters [93.63738535295866]
Sparsely Mixture of Experts (MoE) has received great interest due to its promising scaling capability with affordable computational overhead.
MoE converts dense layers into sparse experts, and utilizes a gated routing network to make experts conditionally activated.
However, as the number of experts grows, MoE with outrageous parameters suffers from overfitting and sparse data allocation.
arXiv Detail & Related papers (2022-07-19T06:09:55Z) - Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained
Language Models [68.9288651177564]
We present a novel MoE architecture based on matrix product operators (MPO) from quantum many-body physics.
With the decomposed MPO structure, we can reduce the parameters of the original MoE architecture.
Experiments on the three well-known downstream natural language datasets based on GPT2 show improved performance and efficiency in increasing model capacity.
arXiv Detail & Related papers (2022-03-02T13:44:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.