Related papers: RomeBERT: Robust Training of Multi-Exit BERT

RomeBERT: Robust Training of Multi-Exit BERT

URL: http://arxiv.org/abs/2101.09755v1
Date: Sun, 24 Jan 2021 17:03:57 GMT
Title: RomeBERT: Robust Training of Multi-Exit BERT
Authors: Shijie Geng, Peng Gao, Zuohui Fu, Yongfeng Zhang
Abstract summary: BERT has achieved superior performances on Natural Language Understanding (NLU) tasks. For acceleration, Dynamic Early Exiting for BERT (DeeBERT) has been proposed recently. In this paper, we leverage gradient regularized self-distillation for RObust training of Multi-Exit BERT (RomeBERT)
Score: 32.127811423380194
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: BERT has achieved superior performances on Natural Language Understanding (NLU) tasks. However, BERT possesses a large number of parameters and demands certain resources to deploy. For acceleration, Dynamic Early Exiting for BERT (DeeBERT) has been proposed recently, which incorporates multiple exits and adopts a dynamic early-exit mechanism to ensure efficient inference. While obtaining an efficiency-performance tradeoff, the performances of early exits in multi-exit BERT are significantly worse than late exits. In this paper, we leverage gradient regularized self-distillation for RObust training of Multi-Exit BERT (RomeBERT), which can effectively solve the performance imbalance problem between early and late exits. Moreover, the proposed RomeBERT adopts a one-stage joint training strategy for multi-exits and the BERT backbone while DeeBERT needs two stages that require more training time. Extensive experiments on GLUE datasets are performed to demonstrate the superiority of our approach. Our code is available at https://github.com/romebert/RomeBERT.

Related papers

SmartBERT: A Promotion of Dynamic Early Exiting Mechanism for Accelerating BERT Inference [18.456002674399244]
We propose a novel dynamic early exiting combined with layer skipping for BERT inference named SmartBERT. SmartBERT can adaptively skip some layers and adaptively choose whether to exit. We conduct experiments on eight classification datasets of the GLUE benchmark.
arXiv Detail & Related papers (2023-03-16T12:44:16Z)
BiBERT: Accurate Fully Binarized BERT [69.35727280997617]
BiBERT is an accurate fully binarized BERT to eliminate the performance bottlenecks. Our method yields impressive 56.3 times and 31.2 times saving on FLOPs and model size.
arXiv Detail & Related papers (2022-03-12T09:46:13Z)
BERTGEN: Multi-task Generation through BERT [30.905286823599976]
We present BERTGEN, a novel generative, decoder-only model which extends BERT by fusing multimodal and multilingual pretrained models. With a comprehensive set of evaluations, we show that BERTGEN outperforms many strong baselines across the tasks explored. We also show BERTGEN's ability for zero-shot language generation, where it exhibits competitive performance to supervised counterparts.
arXiv Detail & Related papers (2021-06-07T10:17:45Z)
TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference [54.791572981834435]
Existing pre-trained language models (PLMs) are often computationally expensive in inference. We propose a dynamic token reduction approach to accelerate PLMs' inference, named TR-BERT. TR-BERT formulates the token reduction process as a multi-step token selection problem and automatically learns the selection strategy via reinforcement learning.
arXiv Detail & Related papers (2021-05-25T02:28:51Z)
CoRe: An Efficient Coarse-refined Training Framework for BERT [17.977099111813644]
We propose a novel coarse-refined training framework named CoRe to speed up the training of BERT. In the first phase, we construct a relaxed BERT model which has much less parameters and much lower model complexity than the original BERT. In the second phase, we transform the trained relaxed BERT model into the original BERT and further retrain the model.
arXiv Detail & Related papers (2020-11-27T09:49:37Z)
TernaryBERT: Distillation-aware Ultra-low Bit BERT [53.06741585060951]
We propose TernaryBERT, which ternarizes the weights in a fine-tuned BERT model. Experiments on the GLUE benchmark and SQuAD show that our proposed TernaryBERT outperforms the other BERT quantization methods.
arXiv Detail & Related papers (2020-09-27T10:17:28Z)
DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference [69.93692147242284]
Large-scale pre-trained language models such as BERT have brought significant improvements to NLP applications. We propose a simple but effective method, DeeBERT, to accelerate BERT inference. Experiments show that DeeBERT is able to save up to 40% inference time with minimal degradation in model quality.
arXiv Detail & Related papers (2020-04-27T17:58:05Z)
DynaBERT: Dynamic BERT with Adaptive Width and Depth [55.18269622415814]
We propose a novel dynamic BERT model (abbreviated as DynaBERT) It can flexibly adjust the size and latency by selecting adaptive width and depth. It consistently outperforms existing BERT compression methods.
arXiv Detail & Related papers (2020-04-08T15:06:28Z)
TwinBERT: Distilling Knowledge to Twin-Structured BERT Models for Efficient Retrieval [11.923682816611716]
We present TwinBERT model for effective and efficient retrieval. It has twin-structured BERT-like encoders to represent query and document respectively. It allows document embeddings to be pre-computed offline and cached in memory.
arXiv Detail & Related papers (2020-02-14T22:44:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.