Related papers: TernaryBERT: Distillation-aware Ultra-low Bit BERT

TernaryBERT: Distillation-aware Ultra-low Bit BERT

URL: http://arxiv.org/abs/2009.12812v3
Date: Sat, 10 Oct 2020 07:24:54 GMT
Title: TernaryBERT: Distillation-aware Ultra-low Bit BERT
Authors: Wei Zhang, Lu Hou, Yichun Yin, Lifeng Shang, Xiao Chen, Xin Jiang, Qun Liu
Abstract summary: We propose TernaryBERT, which ternarizes the weights in a fine-tuned BERT model. Experiments on the GLUE benchmark and SQuAD show that our proposed TernaryBERT outperforms the other BERT quantization methods.
Score: 53.06741585060951
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer-based pre-training models like BERT have achieved remarkable performance in many natural language processing tasks.However, these models are both computation and memory expensive, hindering their deployment to resource-constrained devices. In this work, we propose TernaryBERT, which ternarizes the weights in a fine-tuned BERT model. Specifically, we use both approximation-based and loss-aware ternarization methods and empirically investigate the ternarization granularity of different parts of BERT. Moreover, to reduce the accuracy degradation caused by the lower capacity of low bits, we leverage the knowledge distillation technique in the training process. Experiments on the GLUE benchmark and SQuAD show that our proposed TernaryBERT outperforms the other BERT quantization methods, and even achieves comparable performance as the full-precision model while being 14.9x smaller.

Related papers

Improving Knowledge Distillation for BERT Models: Loss Functions, Mapping Methods, and Weight Tuning [1.1510009152620668]
This project investigates and applies knowledge distillation for BERT model compression. We explore various techniques to improve knowledge distillation, including experimentation with loss functions, transformer layer mapping methods, and tuning the weights of attention and representation loss. The goal of this work is to improve the efficiency and effectiveness of knowledge distillation, enabling the development of more efficient and accurate models for a range of natural language processing tasks.
arXiv Detail & Related papers (2023-08-26T20:59:21Z)
oBERTa: Improving Sparse Transfer Learning via improved initialization, distillation, and pruning regimes [82.99830498937729]
oBERTa is an easy-to-use set of language models for Natural Language Processing. It allows NLP practitioners to obtain between 3.8 and 24.3 times faster models without expertise in model compression. We explore the use of oBERTa on seven representative NLP tasks.
arXiv Detail & Related papers (2023-03-30T01:37:19Z)
BEBERT: Efficient and robust binary ensemble BERT [12.109371576500928]
Binarization of pre-trained BERT models can alleviate this issue but comes with a severe accuracy drop compared with their full-precision counterparts. We propose an efficient and robust binary ensemble BERT (BEBERT) to bridge the accuracy gap.
arXiv Detail & Related papers (2022-10-28T08:15:26Z)
BiBERT: Accurate Fully Binarized BERT [69.35727280997617]
BiBERT is an accurate fully binarized BERT to eliminate the performance bottlenecks. Our method yields impressive 56.3 times and 31.2 times saving on FLOPs and model size.
arXiv Detail & Related papers (2022-03-12T09:46:13Z)
Automatic Mixed-Precision Quantization Search of BERT [62.65905462141319]
Pre-trained language models such as BERT have shown remarkable effectiveness in various natural language processing tasks. These models usually contain millions of parameters, which prevents them from practical deployment on resource-constrained devices. We propose an automatic mixed-precision quantization framework designed for BERT that can simultaneously conduct quantization and pruning in a subgroup-wise level.
arXiv Detail & Related papers (2021-12-30T06:32:47Z)
BinaryBERT: Pushing the Limit of BERT Quantization [74.65543496761553]
We propose BinaryBERT, which pushes BERT quantization to the limit with weight binarization. We find that a binary BERT is hard to be trained directly than a ternary counterpart due to its complex and irregular loss landscapes. Empirical results show that BinaryBERT has negligible performance drop compared to the full-precision BERT-base.
arXiv Detail & Related papers (2020-12-31T16:34:54Z)
DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference [69.93692147242284]
Large-scale pre-trained language models such as BERT have brought significant improvements to NLP applications. We propose a simple but effective method, DeeBERT, to accelerate BERT inference. Experiments show that DeeBERT is able to save up to 40% inference time with minimal degradation in model quality.
arXiv Detail & Related papers (2020-04-27T17:58:05Z)
LadaBERT: Lightweight Adaptation of BERT through Hybrid Model Compression [21.03685890385275]
BERT is a cutting-edge language representation model pre-trained by a large corpus. BERT is memory-intensive and leads to unsatisfactory latency of user requests. We propose a hybrid solution named LadaBERT, which combines the advantages of different model compression methods.
arXiv Detail & Related papers (2020-04-08T17:18:56Z)
Towards Non-task-specific Distillation of BERT via Sentence Representation Approximation [17.62309851473892]
We propose a sentence representation approximating oriented distillation framework that can distill the pre-trained BERT into a simple LSTM based model. Our model is able to perform transfer learning via fine-tuning to adapt to any sentence-level downstream task. The experimental results on multiple NLP tasks from the GLUE benchmark show that our approach outperforms other task-specific distillation methods.
arXiv Detail & Related papers (2020-04-07T03:03:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.