Learning to Route Among Specialized Experts for Zero-Shot Generalization
- URL: http://arxiv.org/abs/2402.05859v2
- Date: Thu, 20 Jun 2024 20:31:51 GMT
- Title: Learning to Route Among Specialized Experts for Zero-Shot Generalization
- Authors: Mohammed Muqeeth, Haokun Liu, Yufan Liu, Colin Raffel,
- Abstract summary: We propose Post-Hoc Adaptive Tokenwise Gating Over an Ocean of Specialized Experts (PHATGOOSE)
It learns to route among specialized modules that were produced through parameter-efficient fine-tuning.
It does not require simultaneous access to the datasets used to create the specialized models and only requires a modest amount of additional compute after each expert model is trained.
- Score: 39.56470042680907
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, there has been a widespread proliferation of "expert" language models that are specialized to a specific task or domain through parameter-efficient fine-tuning. How can we recycle large collections of expert language models to improve zero-shot generalization to unseen tasks? In this work, we propose Post-Hoc Adaptive Tokenwise Gating Over an Ocean of Specialized Experts (PHATGOOSE), which learns to route among specialized modules that were produced through parameter-efficient fine-tuning. Unlike past methods that learn to route among specialized models, PHATGOOSE explores the possibility that zero-shot generalization will be improved if different experts can be adaptively chosen for each token and at each layer in the model. Crucially, our method is post-hoc - it does not require simultaneous access to the datasets used to create the specialized models and only requires a modest amount of additional compute after each expert model is trained. In experiments covering a range of specialized model collections and zero-shot generalization benchmarks, we find that PHATGOOSE outperforms past methods for post-hoc routing and, in some cases, outperforms explicit multitask training (which requires simultaneous data access). To better understand the routing strategy learned by PHATGOOSE, we perform qualitative experiments to validate that PHATGOOSE's performance stems from its ability to make adaptive per-token and per-module expert choices. We release all of our code to support future work on improving zero-shot generalization by recycling specialized experts.
Related papers
- LFME: A Simple Framework for Learning from Multiple Experts in Domain Generalization [61.16890890570814]
Domain generalization (DG) methods aim to maintain good performance in an unseen target domain by using training data from multiple source domains.
This work introduces a simple yet effective framework, dubbed learning from multiple experts (LFME) that aims to make the target model an expert in all source domains to improve DG.
arXiv Detail & Related papers (2024-10-22T13:44:10Z) - Upcycling Instruction Tuning from Dense to Mixture-of-Experts via Parameter Merging [36.0133566024214]
Upcycling Instruction Tuning (UpIT) is a data-efficient approach for tuning a dense pre-trained model into a MoE instruction model.
To ensure each specialized expert in the MoE model works as expected, we select a small amount of seed data that each expert excels to pre-optimize the router.
arXiv Detail & Related papers (2024-10-02T14:48:22Z) - RouterRetriever: Exploring the Benefits of Routing over Multiple Expert Embedding Models [58.987116118425995]
We introduce RouterRetriever, a retrieval model that leverages multiple domain-specific experts.
It is lightweight and allows easy addition or removal of experts without additional training.
It is the first work to demonstrate the advantages of using multiple domain-specific expert embedding models.
arXiv Detail & Related papers (2024-09-04T13:16:55Z) - Revisiting SMoE Language Models by Evaluating Inefficiencies with Task Specific Expert Pruning [78.72226641279863]
Sparse Mixture of Expert (SMoE) models have emerged as a scalable alternative to dense models in language modeling.
Our research explores task-specific model pruning to inform decisions about designing SMoE architectures.
We introduce an adaptive task-aware pruning technique UNCURL to reduce the number of experts per MoE layer in an offline manner post-training.
arXiv Detail & Related papers (2024-09-02T22:35:03Z) - Leveraging Open Knowledge for Advancing Task Expertise in Large Language Models [36.172093066234794]
We introduce few human-annotated samples (i.e., K-shot) for advancing task expertise of large language models with open knowledge.
A mixture-of-expert (MoE) system is built to make the best use of individual-yet-complementary knowledge between multiple experts.
arXiv Detail & Related papers (2024-08-28T16:28:07Z) - Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts [75.85448576746373]
We propose a method of grouping and pruning similar experts to improve the model's parameter efficiency.
We validate the effectiveness of our method by pruning three state-of-the-art MoE architectures.
The evaluation shows that our method outperforms other model pruning methods on a range of natural language tasks.
arXiv Detail & Related papers (2024-07-12T17:25:02Z) - Divide and not forget: Ensemble of selectively trained experts in Continual Learning [0.2886273197127056]
Class-incremental learning is becoming more popular as it helps models widen their applicability while not forgetting what they already know.
A trend in this area is to use a mixture-of-expert technique, where different models work together to solve the task.
SEED selects only one, the most optimal expert for a considered task, and uses data from this task to fine-tune only this expert.
arXiv Detail & Related papers (2024-01-18T18:25:29Z) - Specialist or Generalist? Instruction Tuning for Specific NLP Tasks [58.422495509760154]
We investigate whether incorporating broad-coverage generalist instruction tuning can contribute to building a specialist model.
Our experiments assess four target tasks with distinct coverage levels.
The effect is particularly pronounced when the amount of task-specific training data is limited.
arXiv Detail & Related papers (2023-10-23T19:46:48Z) - Self-Specialization: Uncovering Latent Expertise within Large Language Models [39.04128008742973]
Recent works have demonstrated the effectiveness of self-alignment in which a large language model is aligned to follow general instructions.
We focus on self-alignment for expert domain specialization.
We show that our self-specialized models outperform their base models by a large margin.
arXiv Detail & Related papers (2023-09-29T21:53:46Z) - Diversified Dynamic Routing for Vision Tasks [36.199659460868496]
We propose a novel architecture where each layer is composed of a set of experts.
In our method, the model is explicitly trained to solve the challenge of finding relevant partitioning of the data.
We conduct several experiments on semantic segmentation on Cityscapes and object detection and instance segmentation on MS-COCO.
arXiv Detail & Related papers (2022-09-26T23:27:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.