Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language
Models
- URL: http://arxiv.org/abs/2208.03306v1
- Date: Fri, 5 Aug 2022 17:46:38 GMT
- Title: Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language
Models
- Authors: Margaret Li, Suchin Gururangan, Tim Dettmers, Mike Lewis, Tim Althoff,
Noah A. Smith, Luke Zettlemoyer
- Abstract summary: Branch-Train-Merge (BTM) is an efficient algorithm for parallel training of large language models (LLMs)
BTM learns a set of independent expert LMs (ELMs), each specialized to a different textual domain.
Experiments show that BTM improves in- and out-of-domain perplexities as compared to GPT-style Transformer LMs.
- Score: 106.65127123304842
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present Branch-Train-Merge (BTM), a communication-efficient algorithm for
embarrassingly parallel training of large language models (LLMs). We show it is
possible to independently train subparts of a new class of LLMs on different
subsets of the data, eliminating the massive multi-node synchronization
currently required to train LLMs. BTM learns a set of independent expert LMs
(ELMs), each specialized to a different textual domain, such as scientific or
legal text. These ELMs can be added and removed to update data coverage,
ensembled to generalize to new domains, or averaged to collapse back to a
single LM for efficient inference. New ELMs are learned by branching from
(mixtures of) ELMs in the current set, further training the parameters on data
for the new domain, and then merging the resulting model back into the set for
future use. Experiments show that BTM improves in- and out-of-domain
perplexities as compared to GPT-style Transformer LMs, when controlling for
training cost. Through extensive analysis, we show that these results are
robust to different ELM initialization schemes, but require expert domain
specialization; LM ensembles with random data splits do not perform well. We
also present a study of scaling BTM into a new corpus of 64 domains (192B
whitespace-separated tokens in total); the resulting LM (22.4B total
parameters) performs as well as a Transformer LM trained with 2.5 times more
compute. These gains grow with the number of domains, suggesting more
aggressive parallelism could be used to efficiently train larger models in
future work.
Related papers
- Align$^2$LLaVA: Cascaded Human and Large Language Model Preference Alignment for Multi-modal Instruction Curation [56.75665429851673]
This paper introduces a novel instruction curation algorithm, derived from two unique perspectives, human and LLM preference alignment.
Experiments demonstrate that we can maintain or even improve model performance by compressing synthetic multimodal instructions by up to 90%.
arXiv Detail & Related papers (2024-09-27T08:20:59Z) - Enhancing Discriminative Tasks by Guiding the Pre-trained Language Model with Large Language Model's Experience [4.814313782484443]
Large Language Models (LLMs) and pre-trained Language Models (LMs) have achieved impressive success on many software engineering tasks.
We use LLMs to generate domain-specific data, thereby improving the performance of pre-trained LMs on the target tasks.
arXiv Detail & Related papers (2024-08-16T06:37:59Z) - Extend Model Merging from Fine-Tuned to Pre-Trained Large Language Models via Weight Disentanglement [72.97553348776425]
We make a pioneering effort to broaden the applicability of merging techniques from FT to PT LLMs.
We introduce an approach based on WeIght DisENtanglement (WIDEN) to effectively extend the merging scope.
We merge Qwen1.5-Chat (an FT LLM with instruction-following skills) with Sailor (a PT LLM with multilingual abilities) across 7B and 14B model scales.
arXiv Detail & Related papers (2024-08-06T10:46:46Z) - SoupLM: Model Integration in Large Language and Multi-Modal Models [51.12227693121004]
Training large language models (LLMs) requires significant computing resources.
Existing publicly available LLMs are typically pre-trained on diverse, privately curated datasets spanning various tasks.
arXiv Detail & Related papers (2024-07-11T05:38:15Z) - CALRec: Contrastive Alignment of Generative LLMs for Sequential Recommendation [18.986613405565514]
Large Language Models (LLMs) are pretrained on vast corpora of text for sequential recommendation.
We propose a two-stage LLM finetuning framework that finetunes a pretrained LLM in a two-tower fashion using a mixture of two contrastive losses and a language modeling loss.
Our model significantly outperforms many state-of-the-art baselines.
arXiv Detail & Related papers (2024-05-03T18:51:19Z) - Simple and Scalable Strategies to Continually Pre-train Large Language Models [20.643648785602462]
Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes available.
We show that a simple and scalable combination of learning rate re-warming, LR re-decaying, and replay of previous data is sufficient to match the performance of fully re-training from scratch.
arXiv Detail & Related papers (2024-03-13T17:58:57Z) - Curated LLM: Synergy of LLMs and Data Curation for tabular augmentation in low-data regimes [57.62036621319563]
We introduce CLLM, which leverages the prior knowledge of Large Language Models (LLMs) for data augmentation in the low-data regime.
We demonstrate the superior performance of CLLM in the low-data regime compared to conventional generators.
arXiv Detail & Related papers (2023-12-19T12:34:46Z) - MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation.
Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results.
For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data.
For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z) - Parallelizing Legendre Memory Unit Training [5.076419064097734]
A new recurrent neural network (RNN) named the Legendre Memory Unit (LMU) was proposed and shown to achieve state-of-the-art performance on several benchmark datasets.
Here we leverage the linear time-invariant (LTI) memory component of the LMU to construct a simplified variant that can be parallelized during training.
We show that this reformulation that aids parallelizing, which can be applied generally to any deep network whose recurrent components are linear, makes training up to 200 times faster.
arXiv Detail & Related papers (2021-02-22T23:43:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.