MoIN: Mixture of Introvert Experts to Upcycle an LLM
- URL: http://arxiv.org/abs/2410.09687v1
- Date: Sun, 13 Oct 2024 01:11:04 GMT
- Title: MoIN: Mixture of Introvert Experts to Upcycle an LLM
- Authors: Ajinkya Tejankar, KL Navaneet, Ujjawal Panchal, Kossar Pourahmadi, Hamed Pirsiavash,
- Abstract summary: This paper aims to improve an existing large language model without continued pre-training of the full-model.
The idea is to split the pre-training data into semantically relevant groups and train an expert on each subset.
During inference, an incoming query is first routed to the most relevant expert which is then loaded onto the base model for the forward pass.
- Score: 15.182215869841789
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The goal of this paper is to improve (upcycle) an existing large language model without the prohibitive requirements of continued pre-training of the full-model. The idea is to split the pre-training data into semantically relevant groups and train an expert on each subset. An expert takes the form of a lightweight adapter added on the top of a frozen base model. During inference, an incoming query is first routed to the most relevant expert which is then loaded onto the base model for the forward pass. Unlike typical Mixture of Experts (MoE) models, the experts in our method do not work with other experts for a single query. Hence, we dub them "introvert" experts. Freezing the base model and keeping the experts as lightweight adapters allows extreme parallelism during training and inference. Training of all experts can be done in parallel without any communication channels between them. Similarly, the inference can also be heavily parallelized by distributing experts on different GPUs and routing each request to the GPU containing its relevant expert. We implement a proof-of-concept version of this method and show the validity of our approach.
Related papers
- Upcycling Instruction Tuning from Dense to Mixture-of-Experts via Parameter Merging [36.0133566024214]
Upcycling Instruction Tuning (UpIT) is a data-efficient approach for tuning a dense pre-trained model into a MoE instruction model.
To ensure each specialized expert in the MoE model works as expected, we select a small amount of seed data that each expert excels to pre-optimize the router.
arXiv Detail & Related papers (2024-10-02T14:48:22Z) - Mixture of Efficient Diffusion Experts Through Automatic Interval and Sub-Network Selection [63.96018203905272]
We propose to reduce the sampling cost by pruning a pretrained diffusion model into a mixture of efficient experts.
We demonstrate the effectiveness of our method, DiffPruning, across several datasets.
arXiv Detail & Related papers (2024-09-23T21:27:26Z) - RouterRetriever: Exploring the Benefits of Routing over Multiple Expert Embedding Models [58.987116118425995]
We introduce RouterRetriever, a retrieval model that leverages multiple domain-specific experts.
It is lightweight and allows easy addition or removal of experts without additional training.
It is the first work to demonstrate the advantages of using multiple domain-specific expert embedding models.
arXiv Detail & Related papers (2024-09-04T13:16:55Z) - Beyond Parameter Count: Implicit Bias in Soft Mixture of Experts [44.09546603624385]
We introduce a notion of expert specialization for Soft MoE.
We show that when there are many small experts, the architecture is implicitly biased in a fashion that allows us to efficiently approximate the specialized expert subset.
arXiv Detail & Related papers (2024-09-02T00:39:00Z) - Generalization Error Analysis for Sparse Mixture-of-Experts: A Preliminary Study [65.11303133775857]
Mixture-of-Experts (MoE) computation amalgamates predictions from several specialized sub-models (referred to as experts)
Sparse MoE selectively engages only a limited number, or even just one expert, significantly reducing overhead while empirically preserving, and sometimes even enhancing, performance.
arXiv Detail & Related papers (2024-03-26T05:48:02Z) - Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM [81.18305296110853]
We investigate efficient methods for training Large Language Models (LLMs) to possess capabilities in multiple specialized domains.
Our method, named Branch-Train-MiX (BTX), starts from a seed model, which is branched to train experts in embarrassingly parallel fashion.
BTX generalizes two special cases, the Branch-Train-Merge method, which does not have the MoE finetuning stage to learn routing, and sparse upcycling, which omits the stage of training experts asynchronously.
arXiv Detail & Related papers (2024-03-12T16:54:58Z) - Learning to Route Among Specialized Experts for Zero-Shot Generalization [39.56470042680907]
We propose Post-Hoc Adaptive Tokenwise Gating Over an Ocean of Specialized Experts (PHATGOOSE)
It learns to route among specialized modules that were produced through parameter-efficient fine-tuning.
It does not require simultaneous access to the datasets used to create the specialized models and only requires a modest amount of additional compute after each expert model is trained.
arXiv Detail & Related papers (2024-02-08T17:43:22Z) - Exploiting Inter-Layer Expert Affinity for Accelerating
Mixture-of-Experts Model Inference [3.217776693788795]
We propose a lightweight optimization technique called ExFlow to largely accelerate the inference of pre-trained MoE models.
By exploiting the inter-layer expert affinity, our solution can be directly applied to pre-trained MoE models without any fine-tuning or accuracy degradation.
Our solution beats the cutting-edge MoE implementations with experts from 8 to 64, with up to 2.2x improvement in inference throughput.
arXiv Detail & Related papers (2024-01-16T14:16:47Z) - Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy [84.11508381847929]
Sparsely activated Mixture-of-Experts (SMoE) has shown promise to scale up the learning capacity of neural networks.
We propose M-SMoE, which leverages routing statistics to guide expert merging.
Our MC-SMoE achieves up to 80% memory and a 20% FLOPs reduction, with virtually no loss in performance.
arXiv Detail & Related papers (2023-10-02T16:51:32Z) - On the Representation Collapse of Sparse Mixture of Experts [102.83396489230375]
Sparse mixture of experts provides larger model capacity while requiring a constant computational overhead.
It employs the routing mechanism to distribute input tokens to the best-matched experts according to their hidden representations.
However, learning such a routing mechanism encourages token clustering around expert centroids, implying a trend toward representation collapse.
arXiv Detail & Related papers (2022-04-20T01:40:19Z) - Finding Experts in Transformer Models [2.105564340986074]
We study the presence of expert units in pre-trained Transformer Models (TM), and how they impact a model's performance.
We compile a dataset of 1641 concepts that allows diverse expert units in TM to be discovered.
We show how to self-condition off-the-shelf pre-trained language models to generate text with a given concept by forcing the top experts to be active.
arXiv Detail & Related papers (2020-05-15T17:07:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.