Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient
MoE for Instruction Tuning
- URL: http://arxiv.org/abs/2309.05444v1
- Date: Mon, 11 Sep 2023 13:31:00 GMT
- Title: Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient
MoE for Instruction Tuning
- Authors: Ted Zadouri, Ahmet \"Ust\"un, Arash Ahmadian, Beyza Ermi\c{s}, Acyr
Locatelli, Sara Hooker
- Abstract summary: We propose an extremely parameter-efficient MoE by combining MoE architecture with lightweight experts.
Our method generalizes to unseen tasks as it does not depend on any prior task knowledge.
Our research underscores the versatility of the mixture of experts architecture, showcasing its ability to deliver robust performance even when subjected to rigorous parameter constraints.
- Score: 7.094820944028638
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Mixture of Experts (MoE) is a widely known neural architecture where an
ensemble of specialized sub-models optimizes overall performance with a
constant computational cost. However, conventional MoEs pose challenges at
scale due to the need to store all experts in memory. In this paper, we push
MoE to the limit. We propose extremely parameter-efficient MoE by uniquely
combining MoE architecture with lightweight experts.Our MoE architecture
outperforms standard parameter-efficient fine-tuning (PEFT) methods and is on
par with full fine-tuning by only updating the lightweight experts -- less than
1% of an 11B parameters model. Furthermore, our method generalizes to unseen
tasks as it does not depend on any prior task knowledge. Our research
underscores the versatility of the mixture of experts architecture, showcasing
its ability to deliver robust performance even when subjected to rigorous
parameter constraints. Our code used in all the experiments is publicly
available here: https://github.com/for-ai/parameter-efficient-moe.
Related papers
- HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference [54.40808356999408]
We present HOBBIT, a mixed precision expert offloading system to enable flexible and efficient MoE inference.
Our key insight is that dynamically replacing less critical cache-miss experts with low precision versions can substantially reduce expert-loading latency.
HOBBIT achieves up to a 9.93x speedup in decoding compared to state-of-the-art MoE offloading systems.
arXiv Detail & Related papers (2024-11-03T04:25:46Z) - MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts [63.67734699877724]
MoE++ is a general and heterogeneous MoE framework that integrates both Feed-Forward Network(FFN) and zero-computation experts.
MoE++ achieves better performance while delivering 1.1-2.1x expert forward throughput compared to a vanilla MoE model of the same size.
arXiv Detail & Related papers (2024-10-09T18:01:27Z) - BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts [41.83123857437985]
Training MoEs from scratch in a large-scale regime is prohibitively expensive.
We propose BAM (Branch-Attend-Mix), a simple yet effective method that addresses this shortcoming.
Our experiments on seed models ranging from 590 million to 2 billion parameters demonstrate that BAM surpasses baselines in both perplexity and downstream task performance.
arXiv Detail & Related papers (2024-08-15T17:19:12Z) - Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts [75.85448576746373]
We propose a method of grouping and pruning similar experts to improve the model's parameter efficiency.
We validate the effectiveness of our method by pruning three state-of-the-art MoE architectures.
The evaluation shows that our method outperforms other model pruning methods on a range of natural language tasks.
arXiv Detail & Related papers (2024-07-12T17:25:02Z) - Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models [24.915387910764082]
Expert-Specialized Fine-Tuning, or ESFT, tunes the experts most relevant to downstream tasks while freezing the other experts and modules.
MoE models with finer-grained experts are more advantageous in selecting the combination of experts that are most relevant to downstream tasks.
arXiv Detail & Related papers (2024-07-02T03:11:13Z) - A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-Experts [49.394145046409044]
This paper provides the first provably efficient technique for pruning experts in finetuned MoE models.
We theoretically prove that prioritizing the pruning of the experts with a smaller change of the routers l2 norm from the pretrained model guarantees the preservation of test accuracy.
Although our theoretical analysis is centered on binary classification tasks on simplified MoE architecture, our expert pruning method is verified on large vision MoE models.
arXiv Detail & Related papers (2024-05-26T17:52:58Z) - Efficient Deweather Mixture-of-Experts with Uncertainty-aware
Feature-wise Linear Modulation [44.43376913419967]
We propose an efficient Mixture-of-Experts (MoE) architecture with weight sharing across experts.
MoFME implicitly instantiates multiple experts via learnable activation modulations on a single shared expert block.
Experiments show that our MoFME outperforms the baselines in the image restoration quality by 0.1-0.2 dB.
arXiv Detail & Related papers (2023-12-27T15:23:37Z) - Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts [74.40198929049959]
Large multi-modal models (LMMs) exhibit remarkable performance across numerous tasks.
generalist LMMs often suffer from performance degradation when tuned over a large collection of tasks.
We propose Omni-SMoLA, an architecture that uses the Soft MoE approach to mix many multimodal low rank experts.
arXiv Detail & Related papers (2023-12-01T23:04:27Z) - Task-Specific Expert Pruning for Sparse Mixture-of-Experts [105.20605021416276]
Mixture-of-Experts (MoE) model is powerful for large-scale pre-training.
MoE is hard to be deployed on cloud or mobile environment.
We propose a general method to progressively drop the non-professional experts for the target downstream task.
arXiv Detail & Related papers (2022-06-01T07:09:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.