Staged Training for Transformer Language Models
- URL: http://arxiv.org/abs/2203.06211v1
- Date: Fri, 11 Mar 2022 19:05:42 GMT
- Title: Staged Training for Transformer Language Models
- Authors: Sheng Shen and Pete Walsh and Kurt Keutzer and Jesse Dodge and Matthew
Peters and Iz Beltagy
- Abstract summary: We consider a staged training setup that begins with a small model and incrementally increases the amount of compute used for training.
By initializing each stage with the output of the previous one, the training process effectively re-uses the compute.
We empirically validate our growth operators and staged training for autoregressive language models, showing up to 22% compute savings.
- Score: 47.99321376123886
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The current standard approach to scaling transformer language models trains
each model size from a different random initialization. As an alternative, we
consider a staged training setup that begins with a small model and
incrementally increases the amount of compute used for training by applying a
"growth operator" to increase the model depth and width. By initializing each
stage with the output of the previous one, the training process effectively
re-uses the compute from prior stages and becomes more efficient. Our growth
operators each take as input the entire training state (including model
parameters, optimizer state, learning rate schedule, etc.) and output a new
training state from which training continues. We identify two important
properties of these growth operators, namely that they preserve both the loss
and the "training dynamics" after applying the operator. While the
loss-preserving property has been discussed previously, to the best of our
knowledge this work is the first to identify the importance of preserving the
training dynamics (the rate of decrease of the loss during training). To find
the optimal schedule for stages, we use the scaling laws from (Kaplan et al.,
2020) to find a precise schedule that gives the most compute saving by starting
a new stage when training efficiency starts decreasing. We empirically validate
our growth operators and staged training for autoregressive language models,
showing up to 22% compute savings compared to a strong baseline trained from
scratch. Our code is available at https://github.com/allenai/staged-training.
Related papers
- Reuse, Don't Retrain: A Recipe for Continued Pretraining of Language Models [29.367678364485794]
We show how to design efficacious data distributions and learning rate schedules for continued pretraining of language models.
We show an improvement of 9% in average model accuracy compared to the baseline of continued training on the pretraining set.
arXiv Detail & Related papers (2024-07-09T22:37:59Z) - Landscape-Aware Growing: The Power of a Little LAG [49.897766925371485]
We study the question of how to select the best growing strategy from a given pool of growing strategies.
We present an alternative perspective based on early training dynamics, which we call "landscape-aware growing (LAG)"
arXiv Detail & Related papers (2024-06-04T16:38:57Z) - A Dynamical Model of Neural Scaling Laws [79.59705237659547]
We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization.
Our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.
arXiv Detail & Related papers (2024-02-02T01:41:38Z) - Weight subcloning: direct initialization of transformers using larger
pretrained ones [42.056148990349094]
We introduce a technique to transfer the knowledge of a pretrained model to smaller variants.
Weight subcloning expedites the training of scaled-down transformers by initializing their weights from larger pretrained models.
We achieve 4x faster training for vision transformers in image classification and language models designed for next token prediction.
arXiv Detail & Related papers (2023-12-14T19:08:56Z) - Continual Pre-Training of Large Language Models: How to (re)warm your
model? [21.8468835868142]
Large language models (LLMs) are routinely pre-trained on tokens, only to restart the process over again once new data becomes available.
We study the warmup phase of models pretrained on the Pile (upstream data, 300B tokens) as we continue to pre-train on SlimPajama (downstream data, 297B tokens)
Our results show that while re-warming models first increases the loss on upstream and downstream data, in the longer run it improves the downstream performance, outperforming models trained from scratch$ billionsx2013$even for a large downstream dataset.
arXiv Detail & Related papers (2023-08-08T03:18:18Z) - SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language
Models [4.114555639014612]
We show the benefits of using unstructured weight sparsity to train only a subset of weights during pre-training.
We demonstrate that we can induce up to 75% sparsity into a 1.3B parameter GPT-3 XL model resulting in a 2.5x reduction in pre-training FLOPs.
arXiv Detail & Related papers (2023-03-18T17:56:01Z) - EfficientTrain: Exploring Generalized Curriculum Learning for Training
Visual Backbones [80.662250618795]
This paper presents a new curriculum learning approach for the efficient training of visual backbones (e.g., vision Transformers)
As an off-the-shelf method, it reduces the wall-time training cost of a wide variety of popular models by >1.5x on ImageNet-1K/22K without sacrificing accuracy.
arXiv Detail & Related papers (2022-11-17T17:38:55Z) - bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model.
bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z) - Self-Supervised Pretraining Improves Self-Supervised Pretraining [83.1423204498361]
Self-supervised pretraining requires expensive and lengthy computation, large amounts of data, and is sensitive to data augmentation.
This paper explores Hierarchical PreTraining (HPT), which decreases convergence time and improves accuracy by initializing the pretraining process with an existing pretrained model.
We show HPT converges up to 80x faster, improves accuracy across tasks, and improves the robustness of the self-supervised pretraining process to changes in the image augmentation policy or amount of pretraining data.
arXiv Detail & Related papers (2021-03-23T17:37:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.