Related papers: BTS: Harmonizing Specialized Experts into a Generalist LLM

BTS: Harmonizing Specialized Experts into a Generalist LLM

URL: http://arxiv.org/abs/2502.00075v1
Date: Fri, 31 Jan 2025 07:54:34 GMT
Title: BTS: Harmonizing Specialized Experts into a Generalist LLM
Authors: Qizhen Zhang, Prajjwal Bhargava, Chloe Bi, Chris X. Cai, Jakob Foerster, Jeremy Fu, Punit Singh Koura, Ruan Silva, Sheng Shen, Emily Dinan, Suchin Gururangan, Mike Lewis,
Abstract summary: Branch-Train-Stitch (BTS) is an efficient training algorithm for combining independently trained large language model (LLM) experts into a single, capable generalist model.<n>Compared to alternative model merging approaches, BTS yields the best generalist performance on a variety of downstream tasks.
Score: 52.026293450944635
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present Branch-Train-Stitch (BTS), an efficient and flexible training algorithm for combining independently trained large language model (LLM) experts into a single, capable generalist model. Following Li et al., we start with a single seed language model which is branched into domain-specific (e.g., coding or math) experts with continual pretraining. BTS combines experts into a generalist model using lightweight stitch layers, which are inserted between frozen experts and the seed LLM, and trained on a small datamix of the expert domains. Stitch layers enable the seed LLM to integrate representations from any number of experts during the forward pass, allowing it to generalize to new domains, despite remaining frozen. Because BTS does not alter the constituent LLMs, BTS provides a modular and flexible approach: experts can be easily removed and new experts can be added with only a small amount of training. Compared to alternative model merging approaches, BTS yields the best generalist performance on a variety of downstream tasks, retaining the specialized capabilities of each of the experts.

Related papers

Automatic Expert Discovery in LLM Upcycling via Sparse Interpolated Mixture-of-Experts [6.091286069993439]
SIMoE is an end-to-end algorithm designed to fine-tune a dense pre-trained Large Language Model (LLM) into a MoE-style model.<n>During instruction-tuning, SIMoE automatically identifies multiple specialized experts under a specified sparsity constraint.
arXiv Detail & Related papers (2025-06-14T18:34:38Z)
Symbolic Mixture-of-Experts: Adaptive Skill-based Routing for Heterogeneous Reasoning [76.10639521319382]
We propose Symbolic-MoE, a symbolic, text-based, and gradient-free Mixture-of-Experts framework. We show that Symbolic-MoE's instance-level expert selection improves performance by a large margin but -- when implemented naively -- can introduce a high computational overhead.
arXiv Detail & Related papers (2025-03-07T18:03:13Z)
MoIN: Mixture of Introvert Experts to Upcycle an LLM [15.182215869841789]
This paper aims to improve an existing large language model without continued pre-training of the full-model. The idea is to split the pre-training data into semantically relevant groups and train an expert on each subset. During inference, an incoming query is first routed to the most relevant expert which is then loaded onto the base model for the forward pass.
arXiv Detail & Related papers (2024-10-13T01:11:04Z)
CCoE: A Compact LLM with Collaboration of Experts [0.6144680854063939]
We propose a framework of easily coupling strong domain experts together to fuse into a big Large Language Model (LLM) We start with 5 experts in the domain of Code, Law, text-to- Math and Medical. The results indicate that our CCoE framework can easily and efficiently boost nearly 10%-20% performance on original base model in different domains but using less resources on training, as well as inference.
arXiv Detail & Related papers (2024-07-16T13:03:58Z)
An Expert is Worth One Token: Synergizing Multiple Expert LLMs as Generalist via Expert Token Routing [55.25224913110965]
Expert-Token-Routing represents expert LLMs as special expert tokens within the vocabulary of a meta LLM. It supports learning the implicit expertise of expert LLMs from existing instruction dataset. It also conceals the detailed collaboration process from the user's perspective, facilitating interaction as though it were a singular LLM.
arXiv Detail & Related papers (2024-03-25T15:17:05Z)
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM [81.18305296110853]
We investigate efficient methods for training Large Language Models (LLMs) to possess capabilities in multiple specialized domains. Our method, named Branch-Train-MiX (BTX), starts from a seed model, which is branched to train experts in embarrassingly parallel fashion. BTX generalizes two special cases, the Branch-Train-Merge method, which does not have the MoE finetuning stage to learn routing, and sparse upcycling, which omits the stage of training experts asynchronously.
arXiv Detail & Related papers (2024-03-12T16:54:58Z)
Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts [74.40198929049959]
Large multi-modal models (LMMs) exhibit remarkable performance across numerous tasks. generalist LMMs often suffer from performance degradation when tuned over a large collection of tasks. We propose Omni-SMoLA, an architecture that uses the Soft MoE approach to mix many multimodal low rank experts.
arXiv Detail & Related papers (2023-12-01T23:04:27Z)
Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy [84.11508381847929]
Sparsely activated Mixture-of-Experts (SMoE) has shown promise to scale up the learning capacity of neural networks. We propose M-SMoE, which leverages routing statistics to guide expert merging. Our MC-SMoE achieves up to 80% memory and a 20% FLOPs reduction, with virtually no loss in performance.
arXiv Detail & Related papers (2023-10-02T16:51:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.