Continual Pre-Training of Large Language Models: How to (re)warm your
model?
- URL: http://arxiv.org/abs/2308.04014v2
- Date: Wed, 6 Sep 2023 23:13:07 GMT
- Title: Continual Pre-Training of Large Language Models: How to (re)warm your
model?
- Authors: Kshitij Gupta, Benjamin Th\'erien, Adam Ibrahim, Mats L. Richter,
Quentin Anthony, Eugene Belilovsky, Irina Rish, Timoth\'ee Lesort
- Abstract summary: Large language models (LLMs) are routinely pre-trained on tokens, only to restart the process over again once new data becomes available.
We study the warmup phase of models pretrained on the Pile (upstream data, 300B tokens) as we continue to pre-train on SlimPajama (downstream data, 297B tokens)
Our results show that while re-warming models first increases the loss on upstream and downstream data, in the longer run it improves the downstream performance, outperforming models trained from scratch$ billionsx2013$even for a large downstream dataset.
- Score: 21.8468835868142
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) are routinely pre-trained on billions of tokens,
only to restart the process over again once new data becomes available. A much
cheaper and more efficient solution would be to enable the continual
pre-training of these models, i.e. updating pre-trained models with new data
instead of re-training them from scratch. However, the distribution shift
induced by novel data typically results in degraded performance on past data.
Taking a step towards efficient continual pre-training, in this work, we
examine the effect of different warm-up strategies. Our hypothesis is that the
learning rate must be re-increased to improve compute efficiency when training
on a new dataset. We study the warmup phase of models pre-trained on the Pile
(upstream data, 300B tokens) as we continue to pre-train on SlimPajama
(downstream data, 297B tokens), following a linear warmup and cosine decay
schedule. We conduct all experiments on the Pythia 410M language model
architecture and evaluate performance through validation perplexity. We
experiment with different pre-training checkpoints, various maximum learning
rates, and various warmup lengths. Our results show that while rewarming models
first increases the loss on upstream and downstream data, in the longer run it
improves the downstream performance, outperforming models trained from
scratch$\unicode{x2013}$even for a large downstream dataset.
Related papers
- Data Diet: Can Trimming PET/CT Datasets Enhance Lesion Segmentation? [70.38903555729081]
We describe our approach to compete in the autoPET3 datacentric track.
We find that in the autoPETIII dataset, a model that is trained on the entire dataset exhibits undesirable characteristics.
We counteract this by removing the easiest samples from the training dataset as measured by the model loss before retraining from scratch.
arXiv Detail & Related papers (2024-09-20T14:47:58Z) - Data Shapley in One Training Run [88.59484417202454]
Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts.
Existing approaches require re-training models on different data subsets, which is computationally intensive.
This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest.
arXiv Detail & Related papers (2024-06-16T17:09:24Z) - Simple and Scalable Strategies to Continually Pre-train Large Language Models [20.643648785602462]
Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes available.
We show that a simple and scalable combination of learning rate re-warming, LR re-decaying, and replay of previous data is sufficient to match the performance of fully re-training from scratch.
arXiv Detail & Related papers (2024-03-13T17:58:57Z) - Efficient Grammatical Error Correction Via Multi-Task Training and
Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences.
We formulate each task as a sequence-to-sequence problem and perform multi-task training.
We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z) - D4: Improving LLM Pretraining via Document De-Duplication and
Diversification [38.84592304799403]
We show that careful data selection via pre-trained model embeddings can speed up training.
We also show that repeating data intelligently consistently outperforms baseline training.
arXiv Detail & Related papers (2023-08-23T17:58:14Z) - Self-Distillation for Further Pre-training of Transformers [83.84227016847096]
We propose self-distillation as a regularization for a further pre-training stage.
We empirically validate the efficacy of self-distillation on a variety of benchmark datasets for image and text classification tasks.
arXiv Detail & Related papers (2022-09-30T02:25:12Z) - Revisiting the Updates of a Pre-trained Model for Few-shot Learning [11.871523410051527]
We compare the two popular updating methods, fine-tuning and linear probing.
We find that fine-tuning is better than linear probing as the number of samples increases.
arXiv Detail & Related papers (2022-05-13T08:47:06Z) - bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model.
bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z) - LogME: Practical Assessment of Pre-trained Models for Transfer Learning [80.24059713295165]
The Logarithm of Maximum Evidence (LogME) can be used to assess pre-trained models for transfer learning.
Compared to brute-force fine-tuning, LogME brings over $3000times$ speedup in wall-clock time.
arXiv Detail & Related papers (2021-02-22T13:58:11Z) - A Comparison of LSTM and BERT for Small Corpus [0.0]
Recent advancements in the NLP field showed that transfer learning helps with achieving state-of-the-art results for new tasks by tuning pre-trained models instead of starting from scratch.
In this paper we focus on a real-life scenario that scientists in academia and industry face frequently: given a small dataset, can we use a large pre-trained model like BERT and get better results than simple models?
Our experimental results show that bidirectional LSTM models can achieve significantly higher results than a BERT model for a small dataset and these simple models get trained in much less time than tuning the pre-trained counterparts.
arXiv Detail & Related papers (2020-09-11T14:01:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.