Curriculum Learning: A Regularization Method for Efficient and Stable
Billion-Scale GPT Model Pre-Training
- URL: http://arxiv.org/abs/2108.06084v1
- Date: Fri, 13 Aug 2021 06:32:53 GMT
- Title: Curriculum Learning: A Regularization Method for Efficient and Stable
Billion-Scale GPT Model Pre-Training
- Authors: Conglong Li, Minjia Zhang, Yuxiong He
- Abstract summary: We present a study of a curriculum learning based approach, which helps improve the pre-training convergence speed of autoregressive models.
Our evaluations demonstrate that curriculum learning enables training GPT-2 models with 8x larger batch size and 4x larger learning rate.
- Score: 18.640076155697415
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent works have demonstrated great success in training high-capacity
autoregressive language models (GPT, GPT-2, GPT-3) on a huge amount of
unlabeled text corpus for text generation. Despite showing great results, this
generates two training efficiency challenges. First, training large corpora can
be extremely timing consuming, and how to present training samples to the model
to improve the token-wise convergence speed remains a challenging and open
question. Second, many of these large models have to be trained with hundreds
or even thousands of processors using data-parallelism with a very large batch
size. Despite of its better compute efficiency, it has been observed that
large-batch training often runs into training instability issue or converges to
solutions with bad generalization performance. To overcome these two
challenges, we present a study of a curriculum learning based approach, which
helps improves the pre-training convergence speed of autoregressive models.
More importantly, we find that curriculum learning, as a regularization method,
exerts a gradient variance reduction effect and enables to train autoregressive
models with much larger batch sizes and learning rates without training
instability, further improving the training speed. Our evaluations demonstrate
that curriculum learning enables training GPT-2 models (with up to 1.5B
parameters) with 8x larger batch size and 4x larger learning rate, whereas the
baseline approach struggles with training divergence. To achieve the same
validation perplexity targets during pre-training, curriculum learning reduces
the required number of tokens and wall clock time by up to 59% and 54%,
respectively. To achieve the same or better zero-shot WikiText-103/LAMBADA
evaluation results at the end of pre-training, curriculum learning reduces the
required number of tokens and wall clock time by up to 13% and 61%,
respectively.
Related papers
- Late-to-Early Training: LET LLMs Learn Earlier, So Faster and Better [24.03797089794804]
We propose a Late-to-Early Training (LET) paradigm that enables Large Language Models to learn later knowledge in earlier steps and earlier layers.<n>We identify two key mechanisms that drive LET's effectiveness: late-to-early-step learning and late-to-early-layer learning.<n>Our method achieves up to 1.6$times$ speedup with nearly 5% improvement in downstream task accuracy compared to standard training.
arXiv Detail & Related papers (2026-02-05T07:19:34Z) - Recycling Pretrained Checkpoints: Orthogonal Growth of Mixture-of-Experts for Efficient Large Language Model Pre-Training [70.60554423630803]
We propose to recycle pretrained checkpoints by expanding their parameter counts and continuing training.<n>We scale our approach to models with 70B parameters and over 1T training tokens, achieving 10.66% accuracy gain over training from scratch.
arXiv Detail & Related papers (2025-10-09T09:45:45Z) - Hybrid Dual-Batch and Cyclic Progressive Learning for Efficient Distributed Training [1.084959821967413]
Experimental results with ResNet-18 demonstrate that, compared to conventional training methods, our approach improves accuracy by 3.3%.<n>By combining cyclic progressive learning with dual-batch learning, our hybrid approach improves both model generalization and training efficiency.
arXiv Detail & Related papers (2025-09-30T11:10:47Z) - Pretraining Large Language Models with NVFP4 [53.235038214986865]
We introduce a novel approach for stable and accurate training of large language models (LLMs) using the NVFP4 format.<n>Our method integrates two-dimensional quantization scheme for consistent representations across both the forward and backward passes.<n>Our results show that the model trained with our NVFP4-based pretraining technique achieves training loss and downstream task accuracies comparable to an FP8 baseline.
arXiv Detail & Related papers (2025-09-29T17:53:17Z) - Reinforcement Mid-Training [16.826401071555704]
We propose a framework for efficient, adaptive, and unified reinforcement mid-training.<n>We show that RMT achieves up to +64.91% performance improvement with only 21% of the reasoning length in language modeling.<n>We also show that checkpoints obtained after reinforcement mid-training can benefit the subsequent post-training, yielding up to +18.76% improvement in the mathematical domain.
arXiv Detail & Related papers (2025-09-29T07:21:24Z) - Thinking Augmented Pre-training [88.04395622064708]
Thinking augmented Pre-Training is a universal methodology that augments text with automatically generated thinking trajectories.<n>This paper introduces a simple and scalable approach to improve the data efficiency of large language model (LLM) training by augmenting existing text data with thinking trajectories.
arXiv Detail & Related papers (2025-09-24T14:45:13Z) - FastCuRL: Curriculum Reinforcement Learning with Progressive Context Extension for Efficient Training R1-like Reasoning Models [28.351652568849286]
This paper investigates how the model's context length and the complexity of the training dataset influence the training process of R1-like models.
We propose FastCuRL, a curriculum reinforcement learning framework with the progressive context extension strategy.
arXiv Detail & Related papers (2025-03-21T16:35:31Z) - Alchemist: Towards the Design of Efficient Online Continual Learning System [15.224901317189728]
We propose Alchemist, to the best of our knowledge, the first online continual learning system that efficiently reuses serving activations to increase training throughput.
Alchemy significantly increases training throughput by up to 1.72x, reduces up to 47% memory usage during training, and supports up to 2x more training tokens.
arXiv Detail & Related papers (2025-03-03T00:14:34Z) - Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn, Focus, and Review [50.78587571704713]
Learn-Focus-Review (LFR) is a dynamic training approach that adapts to the model's learning progress.
LFR tracks the model's learning performance across data blocks (sequences of tokens) and prioritizes revisiting challenging regions of the dataset.
Compared to baseline models trained on the full datasets, LFR consistently achieved lower perplexity and higher accuracy.
arXiv Detail & Related papers (2024-09-10T00:59:18Z) - Irreducible Curriculum for Language Model Pretraining [46.895234111411426]
We propose irreducible curriculum as a curriculum learning algorithm for language model pretraining.
Our experiments on the RedPajama-1B dataset demonstrate a consistent improvement on validation perplexity across all 7 domains.
arXiv Detail & Related papers (2023-10-23T22:41:33Z) - Early Weight Averaging meets High Learning Rates for LLM Pre-training [20.671831210738937]
We show that models trained with high learning rates observe higher gains due to checkpoint averaging.
Our training recipe outperforms conventional training and popular checkpoint averaging baselines.
arXiv Detail & Related papers (2023-06-05T20:51:44Z) - LIMA: Less Is More for Alignment [112.93890201395477]
We train LIMA, a 65B parameter LLaMa language model fine-tuned with the standard supervised loss on only 1,000 carefully curated prompts and responses.
LIMA demonstrates remarkably strong performance, learning to follow specific response formats from only a handful of examples.
In a controlled human study, responses from LIMA are either equivalent or strictly preferred to GPT-4 in 43% of cases.
arXiv Detail & Related papers (2023-05-18T17:45:22Z) - Boosting Visual-Language Models by Exploiting Hard Samples [126.35125029639168]
HELIP is a cost-effective strategy tailored to enhance the performance of existing CLIP models.
Our method allows for effortless integration with existing models' training pipelines.
On comprehensive benchmarks, HELIP consistently boosts existing models to achieve leading performance.
arXiv Detail & Related papers (2023-05-09T07:00:17Z) - GAT: Guided Adversarial Training with Pareto-optimal Auxiliary Tasks [73.88590165742721]
We propose a novel adversarial training technique that exploits auxiliary tasks under a limited set of training data.
Our approach extends single-task models into multi-task models during the min-max optimization of adversarial training.
We demonstrate that guided multi-task learning is an actionable and promising avenue to push further the boundaries of model robustness.
arXiv Detail & Related papers (2023-02-06T16:23:24Z) - Q-Ensemble for Offline RL: Don't Scale the Ensemble, Scale the Batch
Size [58.762959061522736]
We show that scaling mini-batch sizes with appropriate learning rate adjustments can speed up the training process by orders of magnitude.
We show that scaling the mini-batch size and naively adjusting the learning rate allows for (1) a reduced size of the Q-ensemble, (2) stronger penalization of out-of-distribution actions, and (3) improved convergence time.
arXiv Detail & Related papers (2022-11-20T21:48:25Z) - Flipped Classroom: Effective Teaching for Time Series Forecasting [0.0]
Sequence-to-sequence models based on LSTM and GRU are a most popular choice for forecasting time series data.
The two most common training strategies within this context are teacher forcing (TF) and free running (FR)
We propose several new curricula, and systematically evaluate their performance in two experimental sets.
arXiv Detail & Related papers (2022-10-17T11:53:25Z) - Efficient NLP Model Finetuning via Multistage Data Filtering [11.058786955754004]
We set to filter training examples in a streaming fashion, in tandem with training the target model.
Our key techniques are (1) automatically determine a training loss threshold for skipping backward training passes; (2) run a meta predictor for further skipping forward training passes.
Our method reduces the required training examples by up to 5.3$times$ and training time by up to 6.8$times$, while only seeing minor accuracy degradation.
arXiv Detail & Related papers (2022-07-28T21:43:31Z) - Jigsaw Clustering for Unsupervised Visual Representation Learning [68.09280490213399]
We propose a new jigsaw clustering pretext task in this paper.
Our method makes use of information from both intra- and inter-images.
It is even comparable to the contrastive learning methods when only half of training batches are used.
arXiv Detail & Related papers (2021-04-01T08:09:26Z) - EarlyBERT: Efficient BERT Training via Early-bird Lottery Tickets [106.79387235014379]
EarlyBERT is a general computationally-efficient training algorithm applicable to both pre-training and fine-tuning of large-scale language models.
We are the first to identify structured winning tickets in the early stage of BERT training, and use them for efficient training.
EarlyBERT easily achieves comparable performance to standard BERT with 3545% less training time.
arXiv Detail & Related papers (2020-12-31T20:38:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.