Related papers: RefBERT: Compressing BERT by Referencing to Pre-computed Representations

RefBERT: Compressing BERT by Referencing to Pre-computed Representations

URL: http://arxiv.org/abs/2106.08898v1
Date: Fri, 11 Jun 2021 01:22:08 GMT
Title: RefBERT: Compressing BERT by Referencing to Pre-computed Representations
Authors: Xinyi Wang, Haiqin Yang, Liang Zhao, Yang Mo, Jianping Shen
Abstract summary: RefBERT can beat the vanilla TinyBERT over 8.1% and achieves more than 94% of the performance of $BERTBASE$ on the GLUE benchmark. RefBERT is 7.4x smaller and 9.5x faster on inference than BERT$_rm BASE$.
Score: 19.807272592342148
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Recently developed large pre-trained language models, e.g., BERT, have achieved remarkable performance in many downstream natural language processing applications. These pre-trained language models often contain hundreds of millions of parameters and suffer from high computation and latency in real-world applications. It is desirable to reduce the computation overhead of the models for fast training and inference while keeping the model performance in downstream applications. Several lines of work utilize knowledge distillation to compress the teacher model to a smaller student model. However, they usually discard the teacher's knowledge when in inference. Differently, in this paper, we propose RefBERT to leverage the knowledge learned from the teacher, i.e., facilitating the pre-computed BERT representation on the reference sample and compressing BERT into a smaller student model. To guarantee our proposal, we provide theoretical justification on the loss function and the usage of reference samples. Significantly, the theoretical result shows that including the pre-computed teacher's representations on the reference samples indeed increases the mutual information in learning the student model. Finally, we conduct the empirical evaluation and show that our RefBERT can beat the vanilla TinyBERT over 8.1\% and achieves more than 94\% of the performance of $\BERTBASE$ on the GLUE benchmark. Meanwhile, RefBERT is 7.4x smaller and 9.5x faster on inference than BERT$_{\rm BASE}$.

Related papers

Enhancing Knowledge Distillation for LLMs with Response-Priming Prompting [1.9461727843485295]
We propose a set of novel response-priming prompting strategies to enhance the performance of student models. Our approach fine-tunes a smaller Llama 3.1 8B Instruct model by distilling knowledge from a quantized Llama 3.1 405B Instruct teacher model. We find that Ground Truth prompting results in a 55% performance increase on GSM8K for a distilled Llama 3.1 8B Instruct.
arXiv Detail & Related papers (2024-12-18T20:41:44Z)
Larger models yield better results? Streamlined severity classification of ADHD-related concerns using BERT-based knowledge distillation [0.6793286055326242]
We create a lightweight yet powerful BERT based model for natural language processing applications. We apply the resulting model, LastBERT, to a real-world task classifying severity levels of Attention Deficit Hyperactivity Disorder (ADHD)-related concerns from social media text data.
arXiv Detail & Related papers (2024-10-30T17:57:44Z)
GenDistiller: Distilling Pre-trained Language Models based on an Autoregressive Generative Model [20.620589404103644]
This paper introduces GenDistiller, a novel knowledge distillation framework which generates the hidden representations of the pre-trained teacher model directly by a much smaller student network. The proposed method takes the previous hidden layer as history and implements a layer-by-layer prediction of the teacher model autoregressively. Experiments reveal the advantage of GenDistiller over the baseline distilling method without an autoregressive framework, with 33% fewer parameters, similar time consumption and better performance on most of the SUPERB tasks.
arXiv Detail & Related papers (2024-06-12T01:25:00Z)
ReFT: Representation Finetuning for Language Models [74.51093640257892]
We develop a family of Representation Finetuning (ReFT) methods. ReFTs operate on a frozen base model and learn task-specific interventions on hidden representations. We showcase LoReFT on eight commonsense reasoning tasks, four arithmetic reasoning tasks, instruction-tuning, and GLUE.
arXiv Detail & Related papers (2024-04-04T17:00:37Z)
oBERTa: Improving Sparse Transfer Learning via improved initialization, distillation, and pruning regimes [82.99830498937729]
oBERTa is an easy-to-use set of language models for Natural Language Processing. It allows NLP practitioners to obtain between 3.8 and 24.3 times faster models without expertise in model compression. We explore the use of oBERTa on seven representative NLP tasks.
arXiv Detail & Related papers (2023-03-30T01:37:19Z)
MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed. We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z)
Sparse Distillation: Speeding Up Text Classification by Using Bigger Models [49.8019791766848]
Distilling state-of-the-art transformer models into lightweight student models is an effective way to reduce computation cost at inference time. In this paper, we aim to further push the limit of inference speed by exploring a new area in the design space of the student model. Our experiments show that the student models retain 97% of the RoBERTa-Large teacher performance on a collection of six text classification tasks.
arXiv Detail & Related papers (2021-10-16T10:04:14Z)
Distilling Dense Representations for Ranking using Tightly-Coupled Teachers [52.85472936277762]
We apply knowledge distillation to improve the recently proposed late-interaction ColBERT model. We distill the knowledge from ColBERT's expressive MaxSim operator for computing relevance scores into a simple dot product. We empirically show that our approach improves query latency and greatly reduces the onerous storage requirements of ColBERT.
arXiv Detail & Related papers (2020-10-22T02:26:01Z)
DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference [69.93692147242284]
Large-scale pre-trained language models such as BERT have brought significant improvements to NLP applications. We propose a simple but effective method, DeeBERT, to accelerate BERT inference. Experiments show that DeeBERT is able to save up to 40% inference time with minimal degradation in model quality.
arXiv Detail & Related papers (2020-04-27T17:58:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.