BTS: Harmonizing Specialized Experts into a Generalist LLM
- URL: http://arxiv.org/abs/2502.00075v1
- Date: Fri, 31 Jan 2025 07:54:34 GMT
- Title: BTS: Harmonizing Specialized Experts into a Generalist LLM
- Authors: Qizhen Zhang, Prajjwal Bhargava, Chloe Bi, Chris X. Cai, Jakob Foerster, Jeremy Fu, Punit Singh Koura, Ruan Silva, Sheng Shen, Emily Dinan, Suchin Gururangan, Mike Lewis,
- Abstract summary: Branch-Train-Stitch (BTS) is an efficient training algorithm for combining independently trained large language model (LLM) experts into a single, capable generalist model.
Compared to alternative model merging approaches, BTS yields the best generalist performance on a variety of downstream tasks.
- Score: 52.026293450944635
- License:
- Abstract: We present Branch-Train-Stitch (BTS), an efficient and flexible training algorithm for combining independently trained large language model (LLM) experts into a single, capable generalist model. Following Li et al., we start with a single seed language model which is branched into domain-specific (e.g., coding or math) experts with continual pretraining. BTS combines experts into a generalist model using lightweight stitch layers, which are inserted between frozen experts and the seed LLM, and trained on a small datamix of the expert domains. Stitch layers enable the seed LLM to integrate representations from any number of experts during the forward pass, allowing it to generalize to new domains, despite remaining frozen. Because BTS does not alter the constituent LLMs, BTS provides a modular and flexible approach: experts can be easily removed and new experts can be added with only a small amount of training. Compared to alternative model merging approaches, BTS yields the best generalist performance on a variety of downstream tasks, retaining the specialized capabilities of each of the experts.
Related papers
- LFME: A Simple Framework for Learning from Multiple Experts in Domain Generalization [61.16890890570814]
Domain generalization (DG) methods aim to maintain good performance in an unseen target domain by using training data from multiple source domains.
This work introduces a simple yet effective framework, dubbed learning from multiple experts (LFME) that aims to make the target model an expert in all source domains to improve DG.
arXiv Detail & Related papers (2024-10-22T13:44:10Z) - MoIN: Mixture of Introvert Experts to Upcycle an LLM [15.182215869841789]
This paper aims to improve an existing large language model without continued pre-training of the full-model.
The idea is to split the pre-training data into semantically relevant groups and train an expert on each subset.
During inference, an incoming query is first routed to the most relevant expert which is then loaded onto the base model for the forward pass.
arXiv Detail & Related papers (2024-10-13T01:11:04Z) - An Expert is Worth One Token: Synergizing Multiple Expert LLMs as Generalist via Expert Token Routing [55.25224913110965]
Expert-Token-Routing represents expert LLMs as special expert tokens within the vocabulary of a meta LLM.
It supports learning the implicit expertise of expert LLMs from existing instruction dataset.
It also conceals the detailed collaboration process from the user's perspective, facilitating interaction as though it were a singular LLM.
arXiv Detail & Related papers (2024-03-25T15:17:05Z) - Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM [81.18305296110853]
We investigate efficient methods for training Large Language Models (LLMs) to possess capabilities in multiple specialized domains.
Our method, named Branch-Train-MiX (BTX), starts from a seed model, which is branched to train experts in embarrassingly parallel fashion.
BTX generalizes two special cases, the Branch-Train-Merge method, which does not have the MoE finetuning stage to learn routing, and sparse upcycling, which omits the stage of training experts asynchronously.
arXiv Detail & Related papers (2024-03-12T16:54:58Z) - Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts [74.40198929049959]
Large multi-modal models (LMMs) exhibit remarkable performance across numerous tasks.
generalist LMMs often suffer from performance degradation when tuned over a large collection of tasks.
We propose Omni-SMoLA, an architecture that uses the Soft MoE approach to mix many multimodal low rank experts.
arXiv Detail & Related papers (2023-12-01T23:04:27Z) - Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy [84.11508381847929]
Sparsely activated Mixture-of-Experts (SMoE) has shown promise to scale up the learning capacity of neural networks.
We propose M-SMoE, which leverages routing statistics to guide expert merging.
Our MC-SMoE achieves up to 80% memory and a 20% FLOPs reduction, with virtually no loss in performance.
arXiv Detail & Related papers (2023-10-02T16:51:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.