bert2BERT: Towards Reusable Pretrained Language Models
- URL: http://arxiv.org/abs/2110.07143v1
- Date: Thu, 14 Oct 2021 04:05:25 GMT
- Title: bert2BERT: Towards Reusable Pretrained Language Models
- Authors: Cheng Chen, Yichun Yin, Lifeng Shang, Xin Jiang, Yujia Qin, Fengyu
Wang, Zhi Wang, Xiao Chen, Zhiyuan Liu, Qun Liu
- Abstract summary: We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model.
bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
- Score: 51.078081486422896
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, researchers tend to pre-train ever-larger language models to
explore the upper limit of deep models. However, large language model
pre-training costs intensive computational resources and most of the models are
trained from scratch without reusing the existing pre-trained models, which is
wasteful. In this paper, we propose bert2BERT, which can effectively transfer
the knowledge of an existing smaller pre-trained model (e.g., BERT_BASE) to a
large model (e.g., BERT_LARGE) through parameter initialization and
significantly improve the pre-training efficiency of the large model.
Specifically, we extend the previous function-preserving on Transformer-based
language model, and further improve it by proposing advanced knowledge for
large model's initialization. In addition, a two-stage pre-training method is
proposed to further accelerate the training process. We did extensive
experiments on representative PLMs (e.g., BERT and GPT) and demonstrate that
(1) our method can save a significant amount of training cost compared with
baselines including learning from scratch, StackBERT and MSLT; (2) our method
is generic and applicable to different types of pre-trained models. In
particular, bert2BERT saves about 45% and 47% computational cost of
pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half
sizes. The source code will be publicly available upon publication.
Related papers
- Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization [22.90653167145603]
We introduce HyperCloning, a method that can expand the parameters of a pre-trained language model to those of a larger model with increased hidden dimensions.
As a result, the larger model already inherits the predictive power and accuracy of the smaller model before the training starts.
arXiv Detail & Related papers (2024-09-19T16:50:26Z) - Reuse, Don't Retrain: A Recipe for Continued Pretraining of Language Models [29.367678364485794]
We show how to design efficacious data distributions and learning rate schedules for continued pretraining of language models.
We show an improvement of 9% in average model accuracy compared to the baseline of continued training on the pretraining set.
arXiv Detail & Related papers (2024-07-09T22:37:59Z) - An Emulator for Fine-Tuning Large Language Models using Small Language
Models [91.02498576056057]
We introduce emulated fine-tuning (EFT), a principled and practical method for sampling from a distribution that approximates the result of pre-training and fine-tuning at different scales.
We show that EFT enables test-time adjustment of competing behavioral traits like helpfulness and harmlessness without additional training.
Finally, a special case of emulated fine-tuning, which we call LM up-scaling, avoids resource-intensive fine-tuning of large pre-trained models by ensembling them with small fine-tuned models.
arXiv Detail & Related papers (2023-10-19T17:57:16Z) - Reusing Pretrained Models by Multi-linear Operators for Efficient
Training [65.64075958382034]
Training large models from scratch usually costs a substantial amount of resources.
Recent studies such as bert2BERT and LiGO have reused small pretrained models to initialize a large model.
We propose a method that linearly correlates each weight of the target model to all the weights of the pretrained model.
arXiv Detail & Related papers (2023-10-16T06:16:47Z) - "Medium" LMs of Code in the Era of LLMs: Lessons From StackOverflow [5.036273913335737]
We train two models: SOBertBase, with 109M parameters, and SOBertLarge with 762M parameters, at a budget of just $$187$ and $$800$ each.
Results demonstrate that pre-training both extensively and properly on in-domain data can yield a powerful and affordable alternative to leveraging closed-source general-purpose models.
arXiv Detail & Related papers (2023-06-05T21:38:30Z) - MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z) - METRO: Efficient Denoising Pretraining of Large Scale Autoencoding
Language Models with Model Generated Signals [151.3601429216877]
We present an efficient method of pretraining large-scale autoencoding language models using training signals generated by an auxiliary model.
We propose a recipe, namely "Model generated dEnoising TRaining Objective" (METRO)
The resultant models, METRO-LM, consisting of up to 5.4 billion parameters, achieve new state-of-the-art on the GLUE, SuperGLUE, and SQuAD benchmarks.
arXiv Detail & Related papers (2022-04-13T21:39:15Z) - Bridging Pre-trained Models and Downstream Tasks for Source Code
Understanding [13.65914588243695]
We propose an approach to bridge pre-trained models and code-related tasks.
We exploit semantic-preserving transformation to enrich downstream data diversity.
We introduce curriculum learning to organize the transformed data in an easy-to-hard manner to fine-tune existing pre-trained models.
arXiv Detail & Related papers (2021-12-04T07:21:28Z) - Transfer training from smaller language model [6.982133308738434]
We find a method to save training time and resource cost by changing the small well-trained model to large model.
We test the target model on several data sets and find it is still comparable with the source model.
arXiv Detail & Related papers (2021-04-23T02:56:02Z) - EarlyBERT: Efficient BERT Training via Early-bird Lottery Tickets [106.79387235014379]
EarlyBERT is a general computationally-efficient training algorithm applicable to both pre-training and fine-tuning of large-scale language models.
We are the first to identify structured winning tickets in the early stage of BERT training, and use them for efficient training.
EarlyBERT easily achieves comparable performance to standard BERT with 3545% less training time.
arXiv Detail & Related papers (2020-12-31T20:38:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.