Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM
- URL: http://arxiv.org/abs/2403.07816v1
- Date: Tue, 12 Mar 2024 16:54:58 GMT
- Title: Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM
- Authors: Sainbayar Sukhbaatar, Olga Golovneva, Vasu Sharma, Hu Xu, Xi Victoria
Lin, Baptiste Rozi\`ere, Jacob Kahn, Daniel Li, Wen-tau Yih, Jason Weston,
Xian Li
- Abstract summary: We investigate efficient methods for training Large Language Models (LLMs) to possess capabilities in multiple specialized domains.
Our method, named Branch-Train-MiX (BTX), starts from a seed model, which is branched to train experts in embarrassingly parallel fashion.
BTX generalizes two special cases, the Branch-Train-Merge method, which does not have the MoE finetuning stage to learn routing, and sparse upcycling, which omits the stage of training experts asynchronously.
- Score: 81.18305296110853
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We investigate efficient methods for training Large Language Models (LLMs) to
possess capabilities in multiple specialized domains, such as coding, math
reasoning and world knowledge. Our method, named Branch-Train-MiX (BTX), starts
from a seed model, which is branched to train experts in embarrassingly
parallel fashion with high throughput and reduced communication cost. After
individual experts are asynchronously trained, BTX brings together their
feedforward parameters as experts in Mixture-of-Expert (MoE) layers and
averages the remaining parameters, followed by an MoE-finetuning stage to learn
token-level routing. BTX generalizes two special cases, the Branch-Train-Merge
method, which does not have the MoE finetuning stage to learn routing, and
sparse upcycling, which omits the stage of training experts asynchronously.
Compared to alternative approaches, BTX achieves the best accuracy-efficiency
tradeoff.
Related papers
- Ada-K Routing: Boosting the Efficiency of MoE-based LLMs [6.954735360168147]
We propose a novel Ada-K routing strategy that dynamically adjusts the number of activated experts for each token.
Our strategy incorporates learnable and lightweight allocator modules that decide customized expert resource allocation tailored to the contextual needs for each token.
arXiv Detail & Related papers (2024-10-14T12:50:04Z) - Upcycling Instruction Tuning from Dense to Mixture-of-Experts via Parameter Merging [36.0133566024214]
Upcycling Instruction Tuning (UpIT) is a data-efficient approach for tuning a dense pre-trained model into a MoE instruction model.
To ensure each specialized expert in the MoE model works as expected, we select a small amount of seed data that each expert excels to pre-optimize the router.
arXiv Detail & Related papers (2024-10-02T14:48:22Z) - BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts [41.83123857437985]
Training MoEs from scratch in a large-scale regime is prohibitively expensive.
We propose BAM (Branch-Attend-Mix), a simple yet effective method that addresses this shortcoming.
Our experiments on seed models ranging from 590 million to 2 billion parameters demonstrate that BAM surpasses baselines in both perplexity and downstream task performance.
arXiv Detail & Related papers (2024-08-15T17:19:12Z) - Layerwise Recurrent Router for Mixture-of-Experts [42.36093735411238]
Mixture-of-Experts (MoE) architecture stands out for its ability to scale model size without significantly increasing training costs.
Current MoE models often display parameter inefficiency.
We introduce the Layerwise Recurrent Router for Mixture-of-Experts (RMoE)
arXiv Detail & Related papers (2024-08-13T10:25:13Z) - Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts [54.529880848937104]
We develop a unified MLLM with the MoE architecture, named Uni-MoE, that can handle a wide array of modalities.
Specifically, it features modality-specific encoders with connectors for a unified multimodal representation.
We evaluate the instruction-tuned Uni-MoE on a comprehensive set of multimodal datasets.
arXiv Detail & Related papers (2024-05-18T12:16:01Z) - Exploiting Inter-Layer Expert Affinity for Accelerating
Mixture-of-Experts Model Inference [3.217776693788795]
We propose a lightweight optimization technique called ExFlow to largely accelerate the inference of pre-trained MoE models.
By exploiting the inter-layer expert affinity, our solution can be directly applied to pre-trained MoE models without any fine-tuning or accuracy degradation.
Our solution beats the cutting-edge MoE implementations with experts from 8 to 64, with up to 2.2x improvement in inference throughput.
arXiv Detail & Related papers (2024-01-16T14:16:47Z) - Adapted Multimodal BERT with Layer-wise Fusion for Sentiment Analysis [84.12658971655253]
We propose Adapted Multimodal BERT, a BERT-based architecture for multimodal tasks.
adapter adjusts the pretrained language model for the task at hand, while the fusion layers perform task-specific, layer-wise fusion of audio-visual information with textual BERT representations.
In our ablations we see that this approach leads to efficient models, that can outperform their fine-tuned counterparts and are robust to input noise.
arXiv Detail & Related papers (2022-12-01T17:31:42Z) - Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language
Models [106.65127123304842]
Branch-Train-Merge (BTM) is an efficient algorithm for parallel training of large language models (LLMs)
BTM learns a set of independent expert LMs (ELMs), each specialized to a different textual domain.
Experiments show that BTM improves in- and out-of-domain perplexities as compared to GPT-style Transformer LMs.
arXiv Detail & Related papers (2022-08-05T17:46:38Z) - StableMoE: Stable Routing Strategy for Mixture of Experts [109.0602120199226]
Mixture-of-Experts (MoE) technique can scale up the model size of Transformers with an affordable computational overhead.
We propose StableMoE with two training stages to address the routing fluctuation problem.
Results show that StableMoE outperforms existing MoE methods in terms of both convergence speed and performance.
arXiv Detail & Related papers (2022-04-18T16:48:19Z) - Taming Sparsely Activated Transformer with Stochastic Experts [76.0711573018493]
Sparsely activated models (SAMs) can easily scale to have outrageously large amounts of parameters without significant increase in computational cost.
In this paper, we propose a new expert-based model, THOR (Transformer witH StOchastic ExpeRts)
Unlike classic expert-based models, such as the Switch Transformer, experts in THOR are randomly activated for each input during training and inference.
arXiv Detail & Related papers (2021-10-08T17:15:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.