Related papers: Curriculum Learning: A Regularization Method for Efficient and Stable Billion-Scale GPT Model Pre-Training

Curriculum Learning: A Regularization Method for Efficient and Stable Billion-Scale GPT Model Pre-Training

URL: http://arxiv.org/abs/2108.06084v1
Date: Fri, 13 Aug 2021 06:32:53 GMT
Title: Curriculum Learning: A Regularization Method for Efficient and Stable Billion-Scale GPT Model Pre-Training
Authors: Conglong Li, Minjia Zhang, Yuxiong He
Abstract summary: We present a study of a curriculum learning based approach, which helps improve the pre-training convergence speed of autoregressive models. Our evaluations demonstrate that curriculum learning enables training GPT-2 models with 8x larger batch size and 4x larger learning rate.
Score: 18.640076155697415
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent works have demonstrated great success in training high-capacity autoregressive language models (GPT, GPT-2, GPT-3) on a huge amount of unlabeled text corpus for text generation. Despite showing great results, this generates two training efficiency challenges. First, training large corpora can be extremely timing consuming, and how to present training samples to the model to improve the token-wise convergence speed remains a challenging and open question. Second, many of these large models have to be trained with hundreds or even thousands of processors using data-parallelism with a very large batch size. Despite of its better compute efficiency, it has been observed that large-batch training often runs into training instability issue or converges to solutions with bad generalization performance. To overcome these two challenges, we present a study of a curriculum learning based approach, which helps improves the pre-training convergence speed of autoregressive models. More importantly, we find that curriculum learning, as a regularization method, exerts a gradient variance reduction effect and enables to train autoregressive models with much larger batch sizes and learning rates without training instability, further improving the training speed. Our evaluations demonstrate that curriculum learning enables training GPT-2 models (with up to 1.5B parameters) with 8x larger batch size and 4x larger learning rate, whereas the baseline approach struggles with training divergence. To achieve the same validation perplexity targets during pre-training, curriculum learning reduces the required number of tokens and wall clock time by up to 59% and 54%, respectively. To achieve the same or better zero-shot WikiText-103/LAMBADA evaluation results at the end of pre-training, curriculum learning reduces the required number of tokens and wall clock time by up to 13% and 61%, respectively.

Related papers

FastCuRL: Curriculum Reinforcement Learning with Progressive Context Extension for Efficient Training R1-like Reasoning Models [28.351652568849286]
This paper investigates how the model's context length and the complexity of the training dataset influence the training process of R1-like models. We propose FastCuRL, a curriculum reinforcement learning framework with the progressive context extension strategy.
arXiv Detail & Related papers (2025-03-21T16:35:31Z)
Alchemist: Towards the Design of Efficient Online Continual Learning System [15.224901317189728]
We propose Alchemist, to the best of our knowledge, the first online continual learning system that efficiently reuses serving activations to increase training throughput. Alchemy significantly increases training throughput by up to 1.72x, reduces up to 47% memory usage during training, and supports up to 2x more training tokens.
arXiv Detail & Related papers (2025-03-03T00:14:34Z)
Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn, Focus, and Review [50.78587571704713]
Learn-Focus-Review (LFR) is a dynamic training approach that adapts to the model's learning progress. LFR tracks the model's learning performance across data blocks (sequences of tokens) and prioritizes revisiting challenging regions of the dataset. Compared to baseline models trained on the full datasets, LFR consistently achieved lower perplexity and higher accuracy.
arXiv Detail & Related papers (2024-09-10T00:59:18Z)
Irreducible Curriculum for Language Model Pretraining [46.895234111411426]
We propose irreducible curriculum as a curriculum learning algorithm for language model pretraining. Our experiments on the RedPajama-1B dataset demonstrate a consistent improvement on validation perplexity across all 7 domains.
arXiv Detail & Related papers (2023-10-23T22:41:33Z)
Early Weight Averaging meets High Learning Rates for LLM Pre-training [20.671831210738937]
We show that models trained with high learning rates observe higher gains due to checkpoint averaging. Our training recipe outperforms conventional training and popular checkpoint averaging baselines.
arXiv Detail & Related papers (2023-06-05T20:51:44Z)
LIMA: Less Is More for Alignment [112.93890201395477]
We train LIMA, a 65B parameter LLaMa language model fine-tuned with the standard supervised loss on only 1,000 carefully curated prompts and responses. LIMA demonstrates remarkably strong performance, learning to follow specific response formats from only a handful of examples. In a controlled human study, responses from LIMA are either equivalent or strictly preferred to GPT-4 in 43% of cases.
arXiv Detail & Related papers (2023-05-18T17:45:22Z)
Boosting Visual-Language Models by Exploiting Hard Samples [126.35125029639168]
HELIP is a cost-effective strategy tailored to enhance the performance of existing CLIP models. Our method allows for effortless integration with existing models' training pipelines. On comprehensive benchmarks, HELIP consistently boosts existing models to achieve leading performance.
arXiv Detail & Related papers (2023-05-09T07:00:17Z)
GAT: Guided Adversarial Training with Pareto-optimal Auxiliary Tasks [73.88590165742721]
We propose a novel adversarial training technique that exploits auxiliary tasks under a limited set of training data. Our approach extends single-task models into multi-task models during the min-max optimization of adversarial training. We demonstrate that guided multi-task learning is an actionable and promising avenue to push further the boundaries of model robustness.
arXiv Detail & Related papers (2023-02-06T16:23:24Z)
Q-Ensemble for Offline RL: Don't Scale the Ensemble, Scale the Batch Size [58.762959061522736]
We show that scaling mini-batch sizes with appropriate learning rate adjustments can speed up the training process by orders of magnitude. We show that scaling the mini-batch size and naively adjusting the learning rate allows for (1) a reduced size of the Q-ensemble, (2) stronger penalization of out-of-distribution actions, and (3) improved convergence time.
arXiv Detail & Related papers (2022-11-20T21:48:25Z)
Flipped Classroom: Effective Teaching for Time Series Forecasting [0.0]
Sequence-to-sequence models based on LSTM and GRU are a most popular choice for forecasting time series data. The two most common training strategies within this context are teacher forcing (TF) and free running (FR) We propose several new curricula, and systematically evaluate their performance in two experimental sets.
arXiv Detail & Related papers (2022-10-17T11:53:25Z)
Efficient NLP Model Finetuning via Multistage Data Filtering [11.058786955754004]
We set to filter training examples in a streaming fashion, in tandem with training the target model. Our key techniques are (1) automatically determine a training loss threshold for skipping backward training passes; (2) run a meta predictor for further skipping forward training passes. Our method reduces the required training examples by up to 5.3$times$ and training time by up to 6.8$times$, while only seeing minor accuracy degradation.
arXiv Detail & Related papers (2022-07-28T21:43:31Z)
Jigsaw Clustering for Unsupervised Visual Representation Learning [68.09280490213399]
We propose a new jigsaw clustering pretext task in this paper. Our method makes use of information from both intra- and inter-images. It is even comparable to the contrastive learning methods when only half of training batches are used.
arXiv Detail & Related papers (2021-04-01T08:09:26Z)
EarlyBERT: Efficient BERT Training via Early-bird Lottery Tickets [106.79387235014379]
EarlyBERT is a general computationally-efficient training algorithm applicable to both pre-training and fine-tuning of large-scale language models. We are the first to identify structured winning tickets in the early stage of BERT training, and use them for efficient training. EarlyBERT easily achieves comparable performance to standard BERT with 3545% less training time.
arXiv Detail & Related papers (2020-12-31T20:38:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.