CoRe: An Efficient Coarse-refined Training Framework for BERT
- URL: http://arxiv.org/abs/2011.13633v2
- Date: Thu, 18 Feb 2021 03:39:00 GMT
- Title: CoRe: An Efficient Coarse-refined Training Framework for BERT
- Authors: Cheng Yang, Shengnan Wang, Yuechuan Li, Chao Yang, Ming Yan, Jingqiao
Zhang, Fangquan Lin
- Abstract summary: We propose a novel coarse-refined training framework named CoRe to speed up the training of BERT.
In the first phase, we construct a relaxed BERT model which has much less parameters and much lower model complexity than the original BERT.
In the second phase, we transform the trained relaxed BERT model into the original BERT and further retrain the model.
- Score: 17.977099111813644
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In recent years, BERT has made significant breakthroughs on many natural
language processing tasks and attracted great attentions. Despite its accuracy
gains, the BERT model generally involves a huge number of parameters and needs
to be trained on massive datasets, so training such a model is computationally
very challenging and time-consuming. Hence, training efficiency should be a
critical issue. In this paper, we propose a novel coarse-refined training
framework named CoRe to speed up the training of BERT. Specifically, we
decompose the training process of BERT into two phases. In the first phase, by
introducing fast attention mechanism and decomposing the large parameters in
the feed-forward network sub-layer, we construct a relaxed BERT model which has
much less parameters and much lower model complexity than the original BERT, so
the relaxed model can be quickly trained. In the second phase, we transform the
trained relaxed BERT model into the original BERT and further retrain the
model. Thanks to the desired initialization provided by the relaxed model, the
retraining phase requires much less training steps, compared with training an
original BERT model from scratch with a random initialization. Experimental
results show that the proposed CoRe framework can greatly reduce the training
time without reducing the performance.
Related papers
- A Multi-Level Framework for Accelerating Training Transformer Models [5.268960238774481]
Training large-scale deep learning models poses an unprecedented demand for computing power.
We propose a multi-level framework for training acceleration based on Coalescing, De-coalescing and Interpolation.
We prove that the proposed framework reduces the computational cost by about 20% on training BERT/GPT-Base models and up to 51.6% on training the BERT-Large model.
arXiv Detail & Related papers (2024-04-07T03:04:34Z) - Effective and Efficient Training for Sequential Recommendation using
Recency Sampling [91.02268704681124]
We propose a novel Recency-based Sampling of Sequences training objective.
We show that the models enhanced with our method can achieve performances exceeding or very close to stateof-the-art BERT4Rec.
arXiv Detail & Related papers (2022-07-06T13:06:31Z) - MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z) - bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model.
bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z) - Speeding up Deep Model Training by Sharing Weights and Then Unsharing [23.35912133295125]
We propose a simple and efficient approach for training the BERT model.
Our approach exploits the special structure of BERT that contains a stack of repeated modules.
arXiv Detail & Related papers (2021-10-08T01:23:34Z) - Fast Certified Robust Training via Better Initialization and Shorter
Warmup [95.81628508228623]
We propose a new IBP and principled regularizers during the warmup stage to stabilize certified bounds.
We find that batch normalization (BN) is a crucial architectural element to build best-performing networks for certified training.
arXiv Detail & Related papers (2021-03-31T17:58:58Z) - Self-Supervised Pretraining Improves Self-Supervised Pretraining [83.1423204498361]
Self-supervised pretraining requires expensive and lengthy computation, large amounts of data, and is sensitive to data augmentation.
This paper explores Hierarchical PreTraining (HPT), which decreases convergence time and improves accuracy by initializing the pretraining process with an existing pretrained model.
We show HPT converges up to 80x faster, improves accuracy across tasks, and improves the robustness of the self-supervised pretraining process to changes in the image augmentation policy or amount of pretraining data.
arXiv Detail & Related papers (2021-03-23T17:37:51Z) - EarlyBERT: Efficient BERT Training via Early-bird Lottery Tickets [106.79387235014379]
EarlyBERT is a general computationally-efficient training algorithm applicable to both pre-training and fine-tuning of large-scale language models.
We are the first to identify structured winning tickets in the early stage of BERT training, and use them for efficient training.
EarlyBERT easily achieves comparable performance to standard BERT with 3545% less training time.
arXiv Detail & Related papers (2020-12-31T20:38:20Z) - Progressively Stacking 2.0: A Multi-stage Layerwise Training Method for
BERT Training Speedup [13.50984315473865]
We propose an efficient multi-stage layerwise training (MSLT) approach to reduce the training time of BERT.
In the proposed training strategy, only top few layers participate in backward computation, while most layers only participate in forward computation.
Experimental results show that the proposed method can achieve more than 110% training speedup without significant performance degradation.
arXiv Detail & Related papers (2020-11-27T10:00:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.