Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient
MoE for Instruction Tuning
- URL: http://arxiv.org/abs/2309.05444v1
- Date: Mon, 11 Sep 2023 13:31:00 GMT
- Title: Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient
MoE for Instruction Tuning
- Authors: Ted Zadouri, Ahmet \"Ust\"un, Arash Ahmadian, Beyza Ermi\c{s}, Acyr
Locatelli, Sara Hooker
- Abstract summary: We propose an extremely parameter-efficient MoE by combining MoE architecture with lightweight experts.
Our method generalizes to unseen tasks as it does not depend on any prior task knowledge.
Our research underscores the versatility of the mixture of experts architecture, showcasing its ability to deliver robust performance even when subjected to rigorous parameter constraints.
- Score: 7.094820944028638
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Mixture of Experts (MoE) is a widely known neural architecture where an
ensemble of specialized sub-models optimizes overall performance with a
constant computational cost. However, conventional MoEs pose challenges at
scale due to the need to store all experts in memory. In this paper, we push
MoE to the limit. We propose extremely parameter-efficient MoE by uniquely
combining MoE architecture with lightweight experts.Our MoE architecture
outperforms standard parameter-efficient fine-tuning (PEFT) methods and is on
par with full fine-tuning by only updating the lightweight experts -- less than
1% of an 11B parameters model. Furthermore, our method generalizes to unseen
tasks as it does not depend on any prior task knowledge. Our research
underscores the versatility of the mixture of experts architecture, showcasing
its ability to deliver robust performance even when subjected to rigorous
parameter constraints. Our code used in all the experiments is publicly
available here: https://github.com/for-ai/parameter-efficient-moe.
Related papers
- Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts [75.85448576746373]
We propose a method of grouping and pruning similar experts to improve model's parameter efficiency.
We validate our method by pruning two state-of-the-art MoE models, Mixtral-8x7B and Mixtral-8x22B.
Our method outperforms other model pruning methods on a range of natural language tasks.
arXiv Detail & Related papers (2024-07-12T17:25:02Z) - Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models [24.915387910764082]
Expert-Specialized Fine-Tuning, or ESFT, tunes the experts most relevant to downstream tasks while freezing the other experts and modules.
MoE models with finer-grained experts are more advantageous in selecting the combination of experts that are most relevant to downstream tasks.
arXiv Detail & Related papers (2024-07-02T03:11:13Z) - A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-Experts [49.394145046409044]
This paper provides the first provably efficient technique for pruning experts in finetuned MoE models.
We theoretically prove that prioritizing the pruning of the experts with a smaller change of the routers l2 norm from the pretrained model guarantees the preservation of test accuracy.
Although our theoretical analysis is centered on binary classification tasks on simplified MoE architecture, our expert pruning method is verified on large vision MoE models.
arXiv Detail & Related papers (2024-05-26T17:52:58Z) - Harder Tasks Need More Experts: Dynamic Routing in MoE Models [58.18526590138739]
We introduce a novel dynamic expert selection framework for Mixture of Experts (MoE) models.
Our method dynamically selects experts based on the confidence level in expert selection for each input.
arXiv Detail & Related papers (2024-03-12T13:41:15Z) - Higher Layers Need More LoRA Experts [23.72297945365351]
We introduce a novel parameter-efficient MoE method, textittextbfMoE-LtextbfoRA with textbfLayer-wise Expert textbfAllocation (MoLA) for Transformer-based models.
Experiments on six well-known NLP and commonsense QA benchmarks demonstrate that MoLA achieves equal or superior performance compared to all baselines.
arXiv Detail & Related papers (2024-02-13T16:04:21Z) - Efficient Deweather Mixture-of-Experts with Uncertainty-aware
Feature-wise Linear Modulation [44.43376913419967]
We propose an efficient Mixture-of-Experts (MoE) architecture with weight sharing across experts.
MoFME implicitly instantiates multiple experts via learnable activation modulations on a single shared expert block.
Experiments show that our MoFME outperforms the baselines in the image restoration quality by 0.1-0.2 dB.
arXiv Detail & Related papers (2023-12-27T15:23:37Z) - Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts [74.40198929049959]
Large multi-modal models (LMMs) exhibit remarkable performance across numerous tasks.
generalist LMMs often suffer from performance degradation when tuned over a large collection of tasks.
We propose Omni-SMoLA, an architecture that uses the Soft MoE approach to mix many multimodal low rank experts.
arXiv Detail & Related papers (2023-12-01T23:04:27Z) - MoEC: Mixture of Expert Clusters [93.63738535295866]
Sparsely Mixture of Experts (MoE) has received great interest due to its promising scaling capability with affordable computational overhead.
MoE converts dense layers into sparse experts, and utilizes a gated routing network to make experts conditionally activated.
However, as the number of experts grows, MoE with outrageous parameters suffers from overfitting and sparse data allocation.
arXiv Detail & Related papers (2022-07-19T06:09:55Z) - Task-Specific Expert Pruning for Sparse Mixture-of-Experts [105.20605021416276]
Mixture-of-Experts (MoE) model is powerful for large-scale pre-training.
MoE is hard to be deployed on cloud or mobile environment.
We propose a general method to progressively drop the non-professional experts for the target downstream task.
arXiv Detail & Related papers (2022-06-01T07:09:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.