Early Weight Averaging meets High Learning Rates for LLM Pre-training
- URL: http://arxiv.org/abs/2306.03241v2
- Date: Mon, 11 Dec 2023 22:31:12 GMT
- Title: Early Weight Averaging meets High Learning Rates for LLM Pre-training
- Authors: Sunny Sanyal, Atula Neerkaje, Jean Kaddour, Abhishek Kumar and Sujay
Sanghavi
- Abstract summary: We show that models trained with high learning rates observe higher gains due to checkpoint averaging.
Our training recipe outperforms conventional training and popular checkpoint averaging baselines.
- Score: 20.671831210738937
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Training Large Language Models (LLMs) incurs significant cost; hence, any
strategy that accelerates model convergence is helpful. In this paper, we
investigate the ability of a simple idea checkpoint averaging along the
trajectory of a training run to improve both convergence and generalization
quite early on during training. Here we show that models trained with high
learning rates observe higher gains due to checkpoint averaging. Furthermore,
these gains are amplified when checkpoints are sampled with considerable
spacing in training steps. Our training recipe outperforms conventional
training and popular checkpoint averaging baselines such as exponential moving
average (EMA) and stochastic moving average (SWA). We evaluate our training
recipe by pre-training LLMs, where high learning rates are inherently preferred
due to extremely large batch sizes. Specifically, we pre-trained nanoGPT-2
models of varying sizes, small (125M), medium (335M), and large (770M)on the
OpenWebText dataset, comprised of 9B tokens. Additionally, we present results
for publicly available Pythia LLMs, ranging from 1B to 12B, which were trained
on the PILE-deduped dataset containing 207B tokens.
Related papers
- What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy.
By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z) - Efficient Continual Pre-training by Mitigating the Stability Gap [68.49269649759005]
We study the behavior of Large Language Models (LLMs) during continual pre-training.
We propose three effective strategies to enhance LLM performance within a fixed compute budget.
Our strategies improve the average medical task performance of the OpenLlama-3B model from 36.2% to 40.7% with only 40% of the original training budget.
arXiv Detail & Related papers (2024-06-21T02:28:37Z) - The Fine Line: Navigating Large Language Model Pretraining with Down-streaming Capability Analysis [27.310894780313618]
This paper undertakes a comprehensive comparison of model capabilities at various pretraining intermediate checkpoints.
We confirm that specific downstream metrics exhibit similar training dynamics across models of different sizes.
In addition to our core findings, we've reproduced Amber and OpenLLaMA, releasing their intermediate checkpoints.
arXiv Detail & Related papers (2024-04-01T16:00:01Z) - SwiftLearn: A Data-Efficient Training Method of Deep Learning Models
using Importance Sampling [3.8330834108666667]
We present SwiftLearn, a data-efficient approach to accelerate training of deep learning models.
This subset is selected based on an importance criteria measured over the entire dataset during warm-up stages.
We show that almost 90% of the data can be dropped achieving an end-to-end average speedup of 3.36x while keeping the average accuracy drop less than 0.92%.
arXiv Detail & Related papers (2023-11-25T22:51:01Z) - D4: Improving LLM Pretraining via Document De-Duplication and
Diversification [38.84592304799403]
We show that careful data selection via pre-trained model embeddings can speed up training.
We also show that repeating data intelligently consistently outperforms baseline training.
arXiv Detail & Related papers (2023-08-23T17:58:14Z) - bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model.
bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z) - Curriculum Learning: A Regularization Method for Efficient and Stable
Billion-Scale GPT Model Pre-Training [18.640076155697415]
We present a study of a curriculum learning based approach, which helps improve the pre-training convergence speed of autoregressive models.
Our evaluations demonstrate that curriculum learning enables training GPT-2 models with 8x larger batch size and 4x larger learning rate.
arXiv Detail & Related papers (2021-08-13T06:32:53Z) - Graph Sampling Based Deep Metric Learning for Generalizable Person
Re-Identification [114.56752624945142]
We argue that the most popular random sampling method, the well-known PK sampler, is not informative and efficient for deep metric learning.
We propose an efficient mini batch sampling method called Graph Sampling (GS) for large-scale metric learning.
arXiv Detail & Related papers (2021-04-04T06:44:15Z) - EarlyBERT: Efficient BERT Training via Early-bird Lottery Tickets [106.79387235014379]
EarlyBERT is a general computationally-efficient training algorithm applicable to both pre-training and fine-tuning of large-scale language models.
We are the first to identify structured winning tickets in the early stage of BERT training, and use them for efficient training.
EarlyBERT easily achieves comparable performance to standard BERT with 3545% less training time.
arXiv Detail & Related papers (2020-12-31T20:38:20Z) - Neural Semi-supervised Learning for Text Classification Under
Large-Scale Pretraining [51.19885385587916]
We conduct studies on semi-supervised learning in the task of text classification under the context of large-scale LM pretraining.
Our work marks an initial step in understanding the behavior of semi-supervised learning models under the context of large-scale pretraining.
arXiv Detail & Related papers (2020-11-17T13:39:05Z) - To Pretrain or Not to Pretrain: Examining the Benefits of Pretraining on
Resource Rich Tasks [25.05882459314221]
We show that as the number of training examples grow into the millions, the accuracy gap between finetuning BERT-based model and training vanilla LSTM from scratch narrows to within 1%.
Our findings indicate that pre-trained models might reach a diminishing return point as the supervised data size increases significantly.
arXiv Detail & Related papers (2020-06-15T18:18:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.