Efficient Training of Language Models to Fill in the Middle
- URL: http://arxiv.org/abs/2207.14255v1
- Date: Thu, 28 Jul 2022 17:40:47 GMT
- Title: Efficient Training of Language Models to Fill in the Middle
- Authors: Mohammad Bavarian, Heewoo Jun, Nikolas Tezak, John Schulman, Christine
McLeavey, Jerry Tworek, Mark Chen
- Abstract summary: We show that autoregressive language models can learn to infill text after we apply a straightforward transformation to the dataset.
We use these ablations to prescribe strong default settings and best practices to train FIM models.
We have released our best infilling model trained with best practices in our API, and release our infilling benchmarks to aid future research.
- Score: 17.118891860985123
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We show that autoregressive language models can learn to infill text after we
apply a straightforward transformation to the dataset, which simply moves a
span of text from the middle of a document to its end. While this data
augmentation has garnered much interest in recent years, we provide extensive
evidence that training models with a large fraction of data transformed in this
way does not harm the original left-to-right generative capability, as measured
by perplexity and sampling evaluations across a wide range of scales. Given the
usefulness, simplicity, and efficiency of training models to fill-in-the-middle
(FIM), we suggest that future autoregressive language models be trained with
FIM by default. To this end, we run a series of ablations on key
hyperparameters, such as the data transformation frequency, the structure of
the transformation, and the method of selecting the infill span. We use these
ablations to prescribe strong default settings and best practices to train FIM
models. We have released our best infilling model trained with best practices
in our API, and release our infilling benchmarks to aid future research.
Related papers
- Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn, Focus, and Review [50.78587571704713]
Large Language Model (LLM) pretraining traditionally relies on autoregressive language modeling on randomly sampled data blocks from web-scale datasets.
We take inspiration from human learning techniques like spaced repetition to hypothesize that random data sampling for LLMs leads to high training cost and low quality models which tend to forget data.
In order to effectively commit web-scale information to long-term memory, we propose the LFR (Learn, Focus, and Review) pedagogy.
arXiv Detail & Related papers (2024-09-10T00:59:18Z) - Extensive Evaluation of Transformer-based Architectures for Adverse Drug
Events Extraction [6.78974856327994]
Adverse Event (ADE) extraction is one of the core tasks in digital pharmacovigilance.
We evaluate 19 Transformer-based models for ADE extraction on informal texts.
At the end of our analyses, we identify a list of take-home messages that can be derived from the experimental data.
arXiv Detail & Related papers (2023-06-08T15:25:24Z) - Dataless Knowledge Fusion by Merging Weights of Language Models [51.8162883997512]
Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models.
This creates a barrier to fusing knowledge across individual models to yield a better single model.
We propose a dataless knowledge fusion method that merges models in their parameter space.
arXiv Detail & Related papers (2022-12-19T20:46:43Z) - Evaluation of HTR models without Ground Truth Material [2.4792948967354236]
evaluation of Handwritten Text Recognition models during their development is straightforward.
But the evaluation process becomes tricky as soon as we switch from development to application.
We show that lexicon-based evaluation can compete with lexicon-based methods.
arXiv Detail & Related papers (2022-01-17T01:26:09Z) - Improving Non-autoregressive Generation with Mixup Training [51.61038444990301]
We present a non-autoregressive generation model based on pre-trained transformer models.
We propose a simple and effective iterative training method called MIx Source and pseudo Target.
Our experiments on three generation benchmarks including question generation, summarization and paraphrase generation, show that the proposed framework achieves the new state-of-the-art results.
arXiv Detail & Related papers (2021-10-21T13:04:21Z) - bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model.
bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z) - Few-shot learning through contextual data augmentation [74.20290390065475]
Machine translation models need to adapt to new data to maintain their performance over time.
We show that adaptation on the scale of one to five examples is possible.
Our model reports better accuracy scores than a reference system trained with on average 313 parallel examples.
arXiv Detail & Related papers (2021-03-31T09:05:43Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - On the importance of pre-training data volume for compact language
models [0.7691755449724638]
We study the impact of pre-training data volume on compact language models.
We observe that well-performing models are obtained with as little as 100 MB of text.
arXiv Detail & Related papers (2020-10-08T07:40:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.