RomeBERT: Robust Training of Multi-Exit BERT
- URL: http://arxiv.org/abs/2101.09755v1
- Date: Sun, 24 Jan 2021 17:03:57 GMT
- Title: RomeBERT: Robust Training of Multi-Exit BERT
- Authors: Shijie Geng, Peng Gao, Zuohui Fu, Yongfeng Zhang
- Abstract summary: BERT has achieved superior performances on Natural Language Understanding (NLU) tasks.
For acceleration, Dynamic Early Exiting for BERT (DeeBERT) has been proposed recently.
In this paper, we leverage gradient regularized self-distillation for RObust training of Multi-Exit BERT (RomeBERT)
- Score: 32.127811423380194
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: BERT has achieved superior performances on Natural Language Understanding
(NLU) tasks. However, BERT possesses a large number of parameters and demands
certain resources to deploy. For acceleration, Dynamic Early Exiting for BERT
(DeeBERT) has been proposed recently, which incorporates multiple exits and
adopts a dynamic early-exit mechanism to ensure efficient inference. While
obtaining an efficiency-performance tradeoff, the performances of early exits
in multi-exit BERT are significantly worse than late exits. In this paper, we
leverage gradient regularized self-distillation for RObust training of
Multi-Exit BERT (RomeBERT), which can effectively solve the performance
imbalance problem between early and late exits. Moreover, the proposed RomeBERT
adopts a one-stage joint training strategy for multi-exits and the BERT
backbone while DeeBERT needs two stages that require more training time.
Extensive experiments on GLUE datasets are performed to demonstrate the
superiority of our approach. Our code is available at
https://github.com/romebert/RomeBERT.
Related papers
- SmartBERT: A Promotion of Dynamic Early Exiting Mechanism for
Accelerating BERT Inference [18.456002674399244]
We propose a novel dynamic early exiting combined with layer skipping for BERT inference named SmartBERT.
SmartBERT can adaptively skip some layers and adaptively choose whether to exit.
We conduct experiments on eight classification datasets of the GLUE benchmark.
arXiv Detail & Related papers (2023-03-16T12:44:16Z) - BiBERT: Accurate Fully Binarized BERT [69.35727280997617]
BiBERT is an accurate fully binarized BERT to eliminate the performance bottlenecks.
Our method yields impressive 56.3 times and 31.2 times saving on FLOPs and model size.
arXiv Detail & Related papers (2022-03-12T09:46:13Z) - BERTGEN: Multi-task Generation through BERT [30.905286823599976]
We present BERTGEN, a novel generative, decoder-only model which extends BERT by fusing multimodal and multilingual pretrained models.
With a comprehensive set of evaluations, we show that BERTGEN outperforms many strong baselines across the tasks explored.
We also show BERTGEN's ability for zero-shot language generation, where it exhibits competitive performance to supervised counterparts.
arXiv Detail & Related papers (2021-06-07T10:17:45Z) - TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference [54.791572981834435]
Existing pre-trained language models (PLMs) are often computationally expensive in inference.
We propose a dynamic token reduction approach to accelerate PLMs' inference, named TR-BERT.
TR-BERT formulates the token reduction process as a multi-step token selection problem and automatically learns the selection strategy via reinforcement learning.
arXiv Detail & Related papers (2021-05-25T02:28:51Z) - CoRe: An Efficient Coarse-refined Training Framework for BERT [17.977099111813644]
We propose a novel coarse-refined training framework named CoRe to speed up the training of BERT.
In the first phase, we construct a relaxed BERT model which has much less parameters and much lower model complexity than the original BERT.
In the second phase, we transform the trained relaxed BERT model into the original BERT and further retrain the model.
arXiv Detail & Related papers (2020-11-27T09:49:37Z) - TernaryBERT: Distillation-aware Ultra-low Bit BERT [53.06741585060951]
We propose TernaryBERT, which ternarizes the weights in a fine-tuned BERT model.
Experiments on the GLUE benchmark and SQuAD show that our proposed TernaryBERT outperforms the other BERT quantization methods.
arXiv Detail & Related papers (2020-09-27T10:17:28Z) - DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference [69.93692147242284]
Large-scale pre-trained language models such as BERT have brought significant improvements to NLP applications.
We propose a simple but effective method, DeeBERT, to accelerate BERT inference.
Experiments show that DeeBERT is able to save up to 40% inference time with minimal degradation in model quality.
arXiv Detail & Related papers (2020-04-27T17:58:05Z) - DynaBERT: Dynamic BERT with Adaptive Width and Depth [55.18269622415814]
We propose a novel dynamic BERT model (abbreviated as DynaBERT)
It can flexibly adjust the size and latency by selecting adaptive width and depth.
It consistently outperforms existing BERT compression methods.
arXiv Detail & Related papers (2020-04-08T15:06:28Z) - TwinBERT: Distilling Knowledge to Twin-Structured BERT Models for
Efficient Retrieval [11.923682816611716]
We present TwinBERT model for effective and efficient retrieval.
It has twin-structured BERT-like encoders to represent query and document respectively.
It allows document embeddings to be pre-computed offline and cached in memory.
arXiv Detail & Related papers (2020-02-14T22:44:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.