RefBERT: Compressing BERT by Referencing to Pre-computed Representations
- URL: http://arxiv.org/abs/2106.08898v1
- Date: Fri, 11 Jun 2021 01:22:08 GMT
- Title: RefBERT: Compressing BERT by Referencing to Pre-computed Representations
- Authors: Xinyi Wang, Haiqin Yang, Liang Zhao, Yang Mo, Jianping Shen
- Abstract summary: RefBERT can beat the vanilla TinyBERT over 8.1% and achieves more than 94% of the performance of $BERTBASE$ on the GLUE benchmark.
RefBERT is 7.4x smaller and 9.5x faster on inference than BERT$_rm BASE$.
- Score: 19.807272592342148
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recently developed large pre-trained language models, e.g., BERT, have
achieved remarkable performance in many downstream natural language processing
applications. These pre-trained language models often contain hundreds of
millions of parameters and suffer from high computation and latency in
real-world applications. It is desirable to reduce the computation overhead of
the models for fast training and inference while keeping the model performance
in downstream applications. Several lines of work utilize knowledge
distillation to compress the teacher model to a smaller student model. However,
they usually discard the teacher's knowledge when in inference. Differently, in
this paper, we propose RefBERT to leverage the knowledge learned from the
teacher, i.e., facilitating the pre-computed BERT representation on the
reference sample and compressing BERT into a smaller student model. To
guarantee our proposal, we provide theoretical justification on the loss
function and the usage of reference samples. Significantly, the theoretical
result shows that including the pre-computed teacher's representations on the
reference samples indeed increases the mutual information in learning the
student model. Finally, we conduct the empirical evaluation and show that our
RefBERT can beat the vanilla TinyBERT over 8.1\% and achieves more than 94\% of
the performance of $\BERTBASE$ on the GLUE benchmark. Meanwhile, RefBERT is
7.4x smaller and 9.5x faster on inference than BERT$_{\rm BASE}$.
Related papers
- Larger models yield better results? Streamlined severity classification of ADHD-related concerns using BERT-based knowledge distillation [0.6793286055326242]
We create a lightweight yet powerful BERT based model for natural language processing applications.
We apply the resulting model, LastBERT, to a real-world task classifying severity levels of Attention Deficit Hyperactivity Disorder (ADHD)-related concerns from social media text data.
arXiv Detail & Related papers (2024-10-30T17:57:44Z) - GenDistiller: Distilling Pre-trained Language Models based on an Autoregressive Generative Model [20.620589404103644]
This paper introduces GenDistiller, a novel knowledge distillation framework which generates the hidden representations of the pre-trained teacher model directly by a much smaller student network.
The proposed method takes the previous hidden layer as history and implements a layer-by-layer prediction of the teacher model autoregressively.
Experiments reveal the advantage of GenDistiller over the baseline distilling method without an autoregressive framework, with 33% fewer parameters, similar time consumption and better performance on most of the SUPERB tasks.
arXiv Detail & Related papers (2024-06-12T01:25:00Z) - ReFT: Representation Finetuning for Language Models [74.51093640257892]
We develop a family of Representation Finetuning (ReFT) methods.
ReFTs operate on a frozen base model and learn task-specific interventions on hidden representations.
We showcase LoReFT on eight commonsense reasoning tasks, four arithmetic reasoning tasks, instruction-tuning, and GLUE.
arXiv Detail & Related papers (2024-04-04T17:00:37Z) - oBERTa: Improving Sparse Transfer Learning via improved initialization,
distillation, and pruning regimes [82.99830498937729]
oBERTa is an easy-to-use set of language models for Natural Language Processing.
It allows NLP practitioners to obtain between 3.8 and 24.3 times faster models without expertise in model compression.
We explore the use of oBERTa on seven representative NLP tasks.
arXiv Detail & Related papers (2023-03-30T01:37:19Z) - MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z) - Sparse Distillation: Speeding Up Text Classification by Using Bigger
Models [49.8019791766848]
Distilling state-of-the-art transformer models into lightweight student models is an effective way to reduce computation cost at inference time.
In this paper, we aim to further push the limit of inference speed by exploring a new area in the design space of the student model.
Our experiments show that the student models retain 97% of the RoBERTa-Large teacher performance on a collection of six text classification tasks.
arXiv Detail & Related papers (2021-10-16T10:04:14Z) - Distilling Dense Representations for Ranking using Tightly-Coupled
Teachers [52.85472936277762]
We apply knowledge distillation to improve the recently proposed late-interaction ColBERT model.
We distill the knowledge from ColBERT's expressive MaxSim operator for computing relevance scores into a simple dot product.
We empirically show that our approach improves query latency and greatly reduces the onerous storage requirements of ColBERT.
arXiv Detail & Related papers (2020-10-22T02:26:01Z) - DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference [69.93692147242284]
Large-scale pre-trained language models such as BERT have brought significant improvements to NLP applications.
We propose a simple but effective method, DeeBERT, to accelerate BERT inference.
Experiments show that DeeBERT is able to save up to 40% inference time with minimal degradation in model quality.
arXiv Detail & Related papers (2020-04-27T17:58:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.