PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector
Elimination
- URL: http://arxiv.org/abs/2001.08950v5
- Date: Tue, 8 Sep 2020 14:11:33 GMT
- Title: PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector
Elimination
- Authors: Saurabh Goyal, Anamitra R. Choudhury, Saurabh M. Raje, Venkatesan T.
Chakaravarthy, Yogish Sabharwal, Ashish Verma
- Abstract summary: We develop a novel method, called PoWER-BERT, for improving the inference time of the popular BERT model.
We demonstrate that our method attains up to 6.8x reduction in inference time with 1% loss in accuracy when applied over ALBERT.
- Score: 4.965114253725414
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We develop a novel method, called PoWER-BERT, for improving the inference
time of the popular BERT model, while maintaining the accuracy. It works by: a)
exploiting redundancy pertaining to word-vectors (intermediate encoder outputs)
and eliminating the redundant vectors. b) determining which word-vectors to
eliminate by developing a strategy for measuring their significance, based on
the self-attention mechanism. c) learning how many word-vectors to eliminate by
augmenting the BERT model and the loss function. Experiments on the standard
GLUE benchmark shows that PoWER-BERT achieves up to 4.5x reduction in inference
time over BERT with <1% loss in accuracy. We show that PoWER-BERT offers
significantly better trade-off between accuracy and inference time compared to
prior methods. We demonstrate that our method attains up to 6.8x reduction in
inference time with <1% loss in accuracy when applied over ALBERT, a highly
compressed version of BERT. The code for PoWER-BERT is publicly available at
https://github.com/IBM/PoWER-BERT.
Related papers
- oBERTa: Improving Sparse Transfer Learning via improved initialization,
distillation, and pruning regimes [82.99830498937729]
oBERTa is an easy-to-use set of language models for Natural Language Processing.
It allows NLP practitioners to obtain between 3.8 and 24.3 times faster models without expertise in model compression.
We explore the use of oBERTa on seven representative NLP tasks.
arXiv Detail & Related papers (2023-03-30T01:37:19Z) - BEBERT: Efficient and robust binary ensemble BERT [12.109371576500928]
Binarization of pre-trained BERT models can alleviate this issue but comes with a severe accuracy drop compared with their full-precision counterparts.
We propose an efficient and robust binary ensemble BERT (BEBERT) to bridge the accuracy gap.
arXiv Detail & Related papers (2022-10-28T08:15:26Z) - BiBERT: Accurate Fully Binarized BERT [69.35727280997617]
BiBERT is an accurate fully binarized BERT to eliminate the performance bottlenecks.
Our method yields impressive 56.3 times and 31.2 times saving on FLOPs and model size.
arXiv Detail & Related papers (2022-03-12T09:46:13Z) - Dynamic-TinyBERT: Boost TinyBERT's Inference Efficiency by Dynamic
Sequence Length [2.8770761243361593]
TinyBERT addresses the computational efficiency by self-distilling BERT into a smaller transformer representation.
Dynamic-TinyBERT is trained only once, performing on-par with BERT and achieving an accuracy-speedup trade-off superior to any other efficient approaches.
arXiv Detail & Related papers (2021-11-18T11:58:19Z) - Elbert: Fast Albert with Confidence-Window Based Early Exit [8.956309416589232]
Large pre-trained language models like BERT are not well-suited for resource-constrained or real-time applications.
We propose the ELBERT, which significantly improves the average inference speed compared to ALBERT due to the proposed confidence-window based early exit mechanism.
arXiv Detail & Related papers (2021-07-01T02:02:39Z) - TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference [54.791572981834435]
Existing pre-trained language models (PLMs) are often computationally expensive in inference.
We propose a dynamic token reduction approach to accelerate PLMs' inference, named TR-BERT.
TR-BERT formulates the token reduction process as a multi-step token selection problem and automatically learns the selection strategy via reinforcement learning.
arXiv Detail & Related papers (2021-05-25T02:28:51Z) - BinaryBERT: Pushing the Limit of BERT Quantization [74.65543496761553]
We propose BinaryBERT, which pushes BERT quantization to the limit with weight binarization.
We find that a binary BERT is hard to be trained directly than a ternary counterpart due to its complex and irregular loss landscapes.
Empirical results show that BinaryBERT has negligible performance drop compared to the full-precision BERT-base.
arXiv Detail & Related papers (2020-12-31T16:34:54Z) - TernaryBERT: Distillation-aware Ultra-low Bit BERT [53.06741585060951]
We propose TernaryBERT, which ternarizes the weights in a fine-tuned BERT model.
Experiments on the GLUE benchmark and SQuAD show that our proposed TernaryBERT outperforms the other BERT quantization methods.
arXiv Detail & Related papers (2020-09-27T10:17:28Z) - DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference [69.93692147242284]
Large-scale pre-trained language models such as BERT have brought significant improvements to NLP applications.
We propose a simple but effective method, DeeBERT, to accelerate BERT inference.
Experiments show that DeeBERT is able to save up to 40% inference time with minimal degradation in model quality.
arXiv Detail & Related papers (2020-04-27T17:58:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.