LadaBERT: Lightweight Adaptation of BERT through Hybrid Model
Compression
- URL: http://arxiv.org/abs/2004.04124v2
- Date: Wed, 21 Oct 2020 15:15:11 GMT
- Title: LadaBERT: Lightweight Adaptation of BERT through Hybrid Model
Compression
- Authors: Yihuan Mao, Yujing Wang, Chufan Wu, Chen Zhang, Yang Wang, Yaming
Yang, Quanlu Zhang, Yunhai Tong, Jing Bai
- Abstract summary: BERT is a cutting-edge language representation model pre-trained by a large corpus.
BERT is memory-intensive and leads to unsatisfactory latency of user requests.
We propose a hybrid solution named LadaBERT, which combines the advantages of different model compression methods.
- Score: 21.03685890385275
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: BERT is a cutting-edge language representation model pre-trained by a large
corpus, which achieves superior performances on various natural language
understanding tasks. However, a major blocking issue of applying BERT to online
services is that it is memory-intensive and leads to unsatisfactory latency of
user requests, raising the necessity of model compression. Existing solutions
leverage the knowledge distillation framework to learn a smaller model that
imitates the behaviors of BERT. However, the training procedure of knowledge
distillation is expensive itself as it requires sufficient training data to
imitate the teacher model. In this paper, we address this issue by proposing a
hybrid solution named LadaBERT (Lightweight adaptation of BERT through hybrid
model compression), which combines the advantages of different model
compression methods, including weight pruning, matrix factorization and
knowledge distillation. LadaBERT achieves state-of-the-art accuracy on various
public datasets while the training overheads can be reduced by an order of
magnitude.
Related papers
- Improving Knowledge Distillation for BERT Models: Loss Functions,
Mapping Methods, and Weight Tuning [1.1510009152620668]
This project investigates and applies knowledge distillation for BERT model compression.
We explore various techniques to improve knowledge distillation, including experimentation with loss functions, transformer layer mapping methods, and tuning the weights of attention and representation loss.
The goal of this work is to improve the efficiency and effectiveness of knowledge distillation, enabling the development of more efficient and accurate models for a range of natural language processing tasks.
arXiv Detail & Related papers (2023-08-26T20:59:21Z) - oBERTa: Improving Sparse Transfer Learning via improved initialization,
distillation, and pruning regimes [82.99830498937729]
oBERTa is an easy-to-use set of language models for Natural Language Processing.
It allows NLP practitioners to obtain between 3.8 and 24.3 times faster models without expertise in model compression.
We explore the use of oBERTa on seven representative NLP tasks.
arXiv Detail & Related papers (2023-03-30T01:37:19Z) - MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z) - Automatic Mixed-Precision Quantization Search of BERT [62.65905462141319]
Pre-trained language models such as BERT have shown remarkable effectiveness in various natural language processing tasks.
These models usually contain millions of parameters, which prevents them from practical deployment on resource-constrained devices.
We propose an automatic mixed-precision quantization framework designed for BERT that can simultaneously conduct quantization and pruning in a subgroup-wise level.
arXiv Detail & Related papers (2021-12-30T06:32:47Z) - ROSITA: Refined BERT cOmpreSsion with InTegrAted techniques [10.983311133796745]
Pre-trained language models of the BERT family have defined the state-of-the-arts in a wide range of NLP tasks.
Performance of BERT-based models is mainly driven by the enormous amount of parameters, which hinders their application to resource-limited scenarios.
We introduce three kinds of compression methods (weight pruning, low-rank factorization and knowledge distillation) and explore a range of designs concerning model architecture.
Our best compressed model, dubbed Refined BERT cOmpreSsion with InTegrAted techniques (ROSITA), is $7.5 times$ smaller than
arXiv Detail & Related papers (2021-03-21T11:33:33Z) - BinaryBERT: Pushing the Limit of BERT Quantization [74.65543496761553]
We propose BinaryBERT, which pushes BERT quantization to the limit with weight binarization.
We find that a binary BERT is hard to be trained directly than a ternary counterpart due to its complex and irregular loss landscapes.
Empirical results show that BinaryBERT has negligible performance drop compared to the full-precision BERT-base.
arXiv Detail & Related papers (2020-12-31T16:34:54Z) - TernaryBERT: Distillation-aware Ultra-low Bit BERT [53.06741585060951]
We propose TernaryBERT, which ternarizes the weights in a fine-tuned BERT model.
Experiments on the GLUE benchmark and SQuAD show that our proposed TernaryBERT outperforms the other BERT quantization methods.
arXiv Detail & Related papers (2020-09-27T10:17:28Z) - DynaBERT: Dynamic BERT with Adaptive Width and Depth [55.18269622415814]
We propose a novel dynamic BERT model (abbreviated as DynaBERT)
It can flexibly adjust the size and latency by selecting adaptive width and depth.
It consistently outperforms existing BERT compression methods.
arXiv Detail & Related papers (2020-04-08T15:06:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.