Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for
Large Language Models
- URL: http://arxiv.org/abs/2305.14705v2
- Date: Wed, 5 Jul 2023 17:24:43 GMT
- Title: Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for
Large Language Models
- Authors: Sheng Shen, Le Hou, Yanqi Zhou, Nan Du, Shayne Longpre, Jason Wei,
Hyung Won Chung, Barret Zoph, William Fedus, Xinyun Chen, Tu Vu, Yuexin Wu,
Wuyang Chen, Albert Webson, Yunxuan Li, Vincent Zhao, Hongkun Yu, Kurt
Keutzer, Trevor Darrell, Denny Zhou
- Abstract summary: We find that MoE models benefit more from instruction tuning than dense models.
Our most powerful model, FLAN-MOE-32B, surpasses the performance of FLAN-PALM-62B on four benchmark tasks.
- Score: 125.91897197446379
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sparse Mixture-of-Experts (MoE) is a neural architecture design that can be
utilized to add learnable parameters to Large Language Models (LLMs) without
increasing inference cost. Instruction tuning is a technique for training LLMs
to follow instructions. We advocate combining these two approaches, as we find
that MoE models benefit more from instruction tuning than dense models. In
particular, we conduct empirical studies across three experimental setups: (i)
Direct finetuning on individual downstream tasks devoid of instruction tuning;
(ii) Instructiontuning followed by in-context few-shot or zero-shot
generalization on downstream tasks; and (iii) Instruction tuning supplemented
by further finetuning on individual downstream tasks. In the first scenario,
MoE models overall underperform dense models of identical computational
capacity. This narrative, however, dramatically changes with the introduction
of instruction tuning (second and third scenario), used independently or in
conjunction with task-specific finetuning. Our most powerful model,
FLAN-MOE-32B, surpasses the performance of FLAN-PALM-62B on four benchmark
tasks, while using only a third of the FLOPs. The advancements embodied
byFLAN-MOE inspire a reevaluation of the design principles of large-scale,
high-performance language models in the framework of task-agnostic learning.
Related papers
- Revisiting SMoE Language Models by Evaluating Inefficiencies with Task Specific Expert Pruning [78.72226641279863]
Sparse Mixture of Expert (SMoE) models have emerged as a scalable alternative to dense models in language modeling.
Our research explores task-specific model pruning to inform decisions about designing SMoE architectures.
We introduce an adaptive task-aware pruning technique UNCURL to reduce the number of experts per MoE layer in an offline manner post-training.
arXiv Detail & Related papers (2024-09-02T22:35:03Z) - Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models [63.36637269634553]
We present a novel method of further improving performance by requiring models to compare multiple reasoning chains.
We find that instruction tuning on DCoT datasets boosts the performance of even smaller, and therefore more accessible, language models.
arXiv Detail & Related papers (2024-07-03T15:01:18Z) - Dynamic Data Mixing Maximizes Instruction Tuning for Mixture-of-Experts [20.202031878825153]
We propose a novel dynamic data mixture for MoE instruction tuning.
Inspired by MoE's token routing preference, we build dataset-level representations and then capture the subtle differences among datasets.
Results on two MoE models demonstrate the effectiveness of our approach on both downstream knowledge & reasoning tasks and open-ended queries.
arXiv Detail & Related papers (2024-06-17T06:47:03Z) - Mosaic-IT: Free Compositional Data Augmentation Improves Instruction Tuning [30.82220015525281]
Mosaic Instruction Tuning (Mosaic-IT) is a human/model-free compositional data augmentation method.
Mosaic-IT randomly creates rich and diverse augmentations from existing instruction tuning data.
Our evaluations demonstrate a superior performance and training efficiency of Mosaic-IT.
arXiv Detail & Related papers (2024-05-22T04:08:20Z) - ILLUMINER: Instruction-tuned Large Language Models as Few-shot Intent Classifier and Slot Filler [1.9015367254988451]
This study evaluates instruction-tuned models (Instruct-LLMs) on popular benchmark datasets for intent classification (IC) and slot filling (SF)
We introduce ILLUMINER, an approach framing IC and SF as language generation tasks for Instruct-LLMs, with a more efficient SF-prompting method compared to prior work.
A comprehensive comparison with multiple baselines shows that our approach, using the FLAN-T5 11B model, outperforms the state-of-the-art joint IC+SF method and in-context learning with GPT3.5 (175B).
arXiv Detail & Related papers (2024-03-26T09:41:21Z) - Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks [5.536630285985836]
We introduce parameter-efficient sparsity crafting (PESC)
PESC crafts dense models into sparse models using the mixture-of-experts (MoE) architecture.
Our best sparse model outperforms other sparse and dense models and exhibits superior general capabilities compared to GP3.5.
arXiv Detail & Related papers (2024-01-05T09:58:09Z) - When Parameter-efficient Tuning Meets General-purpose Vision-language
Models [65.19127815275307]
PETAL revolutionizes the training process by requiring only 0.5% of the total parameters, achieved through a unique mode approximation technique.
Our experiments reveal that PETAL not only outperforms current state-of-the-art methods in most scenarios but also surpasses full fine-tuning models in effectiveness.
arXiv Detail & Related papers (2023-12-16T17:13:08Z) - Improving Discriminative Multi-Modal Learning with Large-Scale
Pre-Trained Models [51.5543321122664]
This paper investigates how to better leverage large-scale pre-trained uni-modal models to enhance discriminative multi-modal learning.
We introduce Multi-Modal Low-Rank Adaptation learning (MMLoRA)
arXiv Detail & Related papers (2023-10-08T15:01:54Z) - Data-Efficiency with a Single GPU: An Exploration of Transfer Methods
for Small Language Models [5.539060030062833]
Multi-task learning, instruction tuning, and prompting have been shown to improve the generalizability of large language models to new tasks.
This work explores and isolate the effects of (i) model size, (ii) general purpose MTL, (iii) in-domain MTL, (iv) instruction tuning, and (v) few-shot fine-tuning for models with fewer than 500 million parameters.
arXiv Detail & Related papers (2022-10-08T01:45:22Z) - Unifying Language Learning Paradigms [96.35981503087567]
We present a unified framework for pre-training models that are universally effective across datasets and setups.
We show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective.
Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization.
arXiv Detail & Related papers (2022-05-10T19:32:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.