Curriculum Learning: A Regularization Method for Efficient and Stable
Billion-Scale GPT Model Pre-Training
- URL: http://arxiv.org/abs/2108.06084v1
- Date: Fri, 13 Aug 2021 06:32:53 GMT
- Title: Curriculum Learning: A Regularization Method for Efficient and Stable
Billion-Scale GPT Model Pre-Training
- Authors: Conglong Li, Minjia Zhang, Yuxiong He
- Abstract summary: We present a study of a curriculum learning based approach, which helps improve the pre-training convergence speed of autoregressive models.
Our evaluations demonstrate that curriculum learning enables training GPT-2 models with 8x larger batch size and 4x larger learning rate.
- Score: 18.640076155697415
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent works have demonstrated great success in training high-capacity
autoregressive language models (GPT, GPT-2, GPT-3) on a huge amount of
unlabeled text corpus for text generation. Despite showing great results, this
generates two training efficiency challenges. First, training large corpora can
be extremely timing consuming, and how to present training samples to the model
to improve the token-wise convergence speed remains a challenging and open
question. Second, many of these large models have to be trained with hundreds
or even thousands of processors using data-parallelism with a very large batch
size. Despite of its better compute efficiency, it has been observed that
large-batch training often runs into training instability issue or converges to
solutions with bad generalization performance. To overcome these two
challenges, we present a study of a curriculum learning based approach, which
helps improves the pre-training convergence speed of autoregressive models.
More importantly, we find that curriculum learning, as a regularization method,
exerts a gradient variance reduction effect and enables to train autoregressive
models with much larger batch sizes and learning rates without training
instability, further improving the training speed. Our evaluations demonstrate
that curriculum learning enables training GPT-2 models (with up to 1.5B
parameters) with 8x larger batch size and 4x larger learning rate, whereas the
baseline approach struggles with training divergence. To achieve the same
validation perplexity targets during pre-training, curriculum learning reduces
the required number of tokens and wall clock time by up to 59% and 54%,
respectively. To achieve the same or better zero-shot WikiText-103/LAMBADA
evaluation results at the end of pre-training, curriculum learning reduces the
required number of tokens and wall clock time by up to 13% and 61%,
respectively.
Related papers
- Irreducible Curriculum for Language Model Pretraining [46.895234111411426]
We propose irreducible curriculum as a curriculum learning algorithm for language model pretraining.
Our experiments on the RedPajama-1B dataset demonstrate a consistent improvement on validation perplexity across all 7 domains.
arXiv Detail & Related papers (2023-10-23T22:41:33Z) - Early Weight Averaging meets High Learning Rates for LLM Pre-training [20.671831210738937]
We show that models trained with high learning rates observe higher gains due to checkpoint averaging.
Our training recipe outperforms conventional training and popular checkpoint averaging baselines.
arXiv Detail & Related papers (2023-06-05T20:51:44Z) - LIMA: Less Is More for Alignment [112.93890201395477]
We train LIMA, a 65B parameter LLaMa language model fine-tuned with the standard supervised loss on only 1,000 carefully curated prompts and responses.
LIMA demonstrates remarkably strong performance, learning to follow specific response formats from only a handful of examples.
In a controlled human study, responses from LIMA are either equivalent or strictly preferred to GPT-4 in 43% of cases.
arXiv Detail & Related papers (2023-05-18T17:45:22Z) - Boosting Visual-Language Models by Exploiting Hard Samples [126.35125029639168]
HELIP is a cost-effective strategy tailored to enhance the performance of existing CLIP models.
Our method allows for effortless integration with existing models' training pipelines.
On comprehensive benchmarks, HELIP consistently boosts existing models to achieve leading performance.
arXiv Detail & Related papers (2023-05-09T07:00:17Z) - GAT: Guided Adversarial Training with Pareto-optimal Auxiliary Tasks [73.88590165742721]
We propose a novel adversarial training technique that exploits auxiliary tasks under a limited set of training data.
Our approach extends single-task models into multi-task models during the min-max optimization of adversarial training.
We demonstrate that guided multi-task learning is an actionable and promising avenue to push further the boundaries of model robustness.
arXiv Detail & Related papers (2023-02-06T16:23:24Z) - Q-Ensemble for Offline RL: Don't Scale the Ensemble, Scale the Batch
Size [58.762959061522736]
We show that scaling mini-batch sizes with appropriate learning rate adjustments can speed up the training process by orders of magnitude.
We show that scaling the mini-batch size and naively adjusting the learning rate allows for (1) a reduced size of the Q-ensemble, (2) stronger penalization of out-of-distribution actions, and (3) improved convergence time.
arXiv Detail & Related papers (2022-11-20T21:48:25Z) - Flipped Classroom: Effective Teaching for Time Series Forecasting [0.0]
Sequence-to-sequence models based on LSTM and GRU are a most popular choice for forecasting time series data.
The two most common training strategies within this context are teacher forcing (TF) and free running (FR)
We propose several new curricula, and systematically evaluate their performance in two experimental sets.
arXiv Detail & Related papers (2022-10-17T11:53:25Z) - Efficient NLP Model Finetuning via Multistage Data Filtering [11.058786955754004]
We set to filter training examples in a streaming fashion, in tandem with training the target model.
Our key techniques are (1) automatically determine a training loss threshold for skipping backward training passes; (2) run a meta predictor for further skipping forward training passes.
Our method reduces the required training examples by up to 5.3$times$ and training time by up to 6.8$times$, while only seeing minor accuracy degradation.
arXiv Detail & Related papers (2022-07-28T21:43:31Z) - Jigsaw Clustering for Unsupervised Visual Representation Learning [68.09280490213399]
We propose a new jigsaw clustering pretext task in this paper.
Our method makes use of information from both intra- and inter-images.
It is even comparable to the contrastive learning methods when only half of training batches are used.
arXiv Detail & Related papers (2021-04-01T08:09:26Z) - EarlyBERT: Efficient BERT Training via Early-bird Lottery Tickets [106.79387235014379]
EarlyBERT is a general computationally-efficient training algorithm applicable to both pre-training and fine-tuning of large-scale language models.
We are the first to identify structured winning tickets in the early stage of BERT training, and use them for efficient training.
EarlyBERT easily achieves comparable performance to standard BERT with 3545% less training time.
arXiv Detail & Related papers (2020-12-31T20:38:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.