Related papers: DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference

DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference

URL: http://arxiv.org/abs/2004.12993v1
Date: Mon, 27 Apr 2020 17:58:05 GMT
Title: DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference
Authors: Ji Xin, Raphael Tang, Jaejun Lee, Yaoliang Yu, Jimmy Lin
Abstract summary: Large-scale pre-trained language models such as BERT have brought significant improvements to NLP applications. We propose a simple but effective method, DeeBERT, to accelerate BERT inference. Experiments show that DeeBERT is able to save up to 40% inference time with minimal degradation in model quality.
Score: 69.93692147242284
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large-scale pre-trained language models such as BERT have brought significant improvements to NLP applications. However, they are also notorious for being slow in inference, which makes them difficult to deploy in real-time applications. We propose a simple but effective method, DeeBERT, to accelerate BERT inference. Our approach allows samples to exit earlier without passing through the entire model. Experiments show that DeeBERT is able to save up to ~40% inference time with minimal degradation in model quality. Further analyses show different behaviors in the BERT transformer layers and also reveal their redundancy. Our work provides new ideas to efficiently apply deep transformer-based models to downstream tasks. Code is available at https://github.com/castorini/DeeBERT.

Related papers

CEEBERT: Cross-Domain Inference in Early Exit BERT [5.402030962296633]
CeeBERT learns optimal thresholds from domain-specific confidence observed at intermediate layers on the fly. CeeBERT can speed up the BERT/ALBERT models by $2times$ - $3.5times$ with minimal drop in accuracy.
arXiv Detail & Related papers (2024-05-23T20:36:10Z)
oBERTa: Improving Sparse Transfer Learning via improved initialization, distillation, and pruning regimes [82.99830498937729]
oBERTa is an easy-to-use set of language models for Natural Language Processing. It allows NLP practitioners to obtain between 3.8 and 24.3 times faster models without expertise in model compression. We explore the use of oBERTa on seven representative NLP tasks.
arXiv Detail & Related papers (2023-03-30T01:37:19Z)
SmartBERT: A Promotion of Dynamic Early Exiting Mechanism for Accelerating BERT Inference [18.456002674399244]
We propose a novel dynamic early exiting combined with layer skipping for BERT inference named SmartBERT. SmartBERT can adaptively skip some layers and adaptively choose whether to exit. We conduct experiments on eight classification datasets of the GLUE benchmark.
arXiv Detail & Related papers (2023-03-16T12:44:16Z)
NarrowBERT: Accelerating Masked Language Model Pretraining and Inference [50.59811343945605]
We propose NarrowBERT, a modified transformer encoder that increases the throughput for masked language model pretraining by more than $2times$. NarrowBERT sparsifies the transformer model such that the self-attention queries and feedforward layers only operate on the masked tokens of each sentence during pretraining. We show that NarrowBERT increases the throughput at inference time by as much as $3.5times$ with minimal (or no) performance degradation on sentence encoding tasks like MNLI.
arXiv Detail & Related papers (2023-01-11T23:45:50Z)
MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed. We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z)
TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference [54.791572981834435]
Existing pre-trained language models (PLMs) are often computationally expensive in inference. We propose a dynamic token reduction approach to accelerate PLMs' inference, named TR-BERT. TR-BERT formulates the token reduction process as a multi-step token selection problem and automatically learns the selection strategy via reinforcement learning.
arXiv Detail & Related papers (2021-05-25T02:28:51Z)
RomeBERT: Robust Training of Multi-Exit BERT [32.127811423380194]
BERT has achieved superior performances on Natural Language Understanding (NLU) tasks. For acceleration, Dynamic Early Exiting for BERT (DeeBERT) has been proposed recently. In this paper, we leverage gradient regularized self-distillation for RObust training of Multi-Exit BERT (RomeBERT)
arXiv Detail & Related papers (2021-01-24T17:03:57Z)
TernaryBERT: Distillation-aware Ultra-low Bit BERT [53.06741585060951]
We propose TernaryBERT, which ternarizes the weights in a fine-tuned BERT model. Experiments on the GLUE benchmark and SQuAD show that our proposed TernaryBERT outperforms the other BERT quantization methods.
arXiv Detail & Related papers (2020-09-27T10:17:28Z)
TwinBERT: Distilling Knowledge to Twin-Structured BERT Models for Efficient Retrieval [11.923682816611716]
We present TwinBERT model for effective and efficient retrieval. It has twin-structured BERT-like encoders to represent query and document respectively. It allows document embeddings to be pre-computed offline and cached in memory.
arXiv Detail & Related papers (2020-02-14T22:44:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.