Related papers: CoRe: An Efficient Coarse-refined Training Framework for BERT

CoRe: An Efficient Coarse-refined Training Framework for BERT

URL: http://arxiv.org/abs/2011.13633v2
Date: Thu, 18 Feb 2021 03:39:00 GMT
Title: CoRe: An Efficient Coarse-refined Training Framework for BERT
Authors: Cheng Yang, Shengnan Wang, Yuechuan Li, Chao Yang, Ming Yan, Jingqiao Zhang, Fangquan Lin
Abstract summary: We propose a novel coarse-refined training framework named CoRe to speed up the training of BERT. In the first phase, we construct a relaxed BERT model which has much less parameters and much lower model complexity than the original BERT. In the second phase, we transform the trained relaxed BERT model into the original BERT and further retrain the model.
Score: 17.977099111813644
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: In recent years, BERT has made significant breakthroughs on many natural language processing tasks and attracted great attentions. Despite its accuracy gains, the BERT model generally involves a huge number of parameters and needs to be trained on massive datasets, so training such a model is computationally very challenging and time-consuming. Hence, training efficiency should be a critical issue. In this paper, we propose a novel coarse-refined training framework named CoRe to speed up the training of BERT. Specifically, we decompose the training process of BERT into two phases. In the first phase, by introducing fast attention mechanism and decomposing the large parameters in the feed-forward network sub-layer, we construct a relaxed BERT model which has much less parameters and much lower model complexity than the original BERT, so the relaxed model can be quickly trained. In the second phase, we transform the trained relaxed BERT model into the original BERT and further retrain the model. Thanks to the desired initialization provided by the relaxed model, the retraining phase requires much less training steps, compared with training an original BERT model from scratch with a random initialization. Experimental results show that the proposed CoRe framework can greatly reduce the training time without reducing the performance.

Related papers

A Multi-Level Framework for Accelerating Training Transformer Models [5.268960238774481]
Training large-scale deep learning models poses an unprecedented demand for computing power. We propose a multi-level framework for training acceleration based on Coalescing, De-coalescing and Interpolation. We prove that the proposed framework reduces the computational cost by about 20% on training BERT/GPT-Base models and up to 51.6% on training the BERT-Large model.
arXiv Detail & Related papers (2024-04-07T03:04:34Z)
Effective and Efficient Training for Sequential Recommendation using Recency Sampling [91.02268704681124]
We propose a novel Recency-based Sampling of Sequences training objective. We show that the models enhanced with our method can achieve performances exceeding or very close to stateof-the-art BERT4Rec.
arXiv Detail & Related papers (2022-07-06T13:06:31Z)
MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed. We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z)
bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model. bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z)
Speeding up Deep Model Training by Sharing Weights and Then Unsharing [23.35912133295125]
We propose a simple and efficient approach for training the BERT model. Our approach exploits the special structure of BERT that contains a stack of repeated modules.
arXiv Detail & Related papers (2021-10-08T01:23:34Z)
Fast Certified Robust Training via Better Initialization and Shorter Warmup [95.81628508228623]
We propose a new IBP and principled regularizers during the warmup stage to stabilize certified bounds. We find that batch normalization (BN) is a crucial architectural element to build best-performing networks for certified training.
arXiv Detail & Related papers (2021-03-31T17:58:58Z)
Self-Supervised Pretraining Improves Self-Supervised Pretraining [83.1423204498361]
Self-supervised pretraining requires expensive and lengthy computation, large amounts of data, and is sensitive to data augmentation. This paper explores Hierarchical PreTraining (HPT), which decreases convergence time and improves accuracy by initializing the pretraining process with an existing pretrained model. We show HPT converges up to 80x faster, improves accuracy across tasks, and improves the robustness of the self-supervised pretraining process to changes in the image augmentation policy or amount of pretraining data.
arXiv Detail & Related papers (2021-03-23T17:37:51Z)
EarlyBERT: Efficient BERT Training via Early-bird Lottery Tickets [106.79387235014379]
EarlyBERT is a general computationally-efficient training algorithm applicable to both pre-training and fine-tuning of large-scale language models. We are the first to identify structured winning tickets in the early stage of BERT training, and use them for efficient training. EarlyBERT easily achieves comparable performance to standard BERT with 3545% less training time.
arXiv Detail & Related papers (2020-12-31T20:38:20Z)
Progressively Stacking 2.0: A Multi-stage Layerwise Training Method for BERT Training Speedup [13.50984315473865]
We propose an efficient multi-stage layerwise training (MSLT) approach to reduce the training time of BERT. In the proposed training strategy, only top few layers participate in backward computation, while most layers only participate in forward computation. Experimental results show that the proposed method can achieve more than 110% training speedup without significant performance degradation.
arXiv Detail & Related papers (2020-11-27T10:00:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.