TwinBERT: Distilling Knowledge to Twin-Structured BERT Models for
Efficient Retrieval
- URL: http://arxiv.org/abs/2002.06275v1
- Date: Fri, 14 Feb 2020 22:44:36 GMT
- Title: TwinBERT: Distilling Knowledge to Twin-Structured BERT Models for
Efficient Retrieval
- Authors: Wenhao Lu, Jian Jiao, Ruofei Zhang
- Abstract summary: We present TwinBERT model for effective and efficient retrieval.
It has twin-structured BERT-like encoders to represent query and document respectively.
It allows document embeddings to be pre-computed offline and cached in memory.
- Score: 11.923682816611716
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained language models like BERT have achieved great success in a wide
variety of NLP tasks, while the superior performance comes with high demand in
computational resources, which hinders the application in low-latency IR
systems. We present TwinBERT model for effective and efficient retrieval, which
has twin-structured BERT-like encoders to represent query and document
respectively and a crossing layer to combine the embeddings and produce a
similarity score. Different from BERT, where the two input sentences are
concatenated and encoded together, TwinBERT decouples them during encoding and
produces the embeddings for query and document independently, which allows
document embeddings to be pre-computed offline and cached in memory. Thereupon,
the computation left for run-time is from the query encoding and query-document
crossing only. This single change can save large amount of computation time and
resources, and therefore significantly improve serving efficiency. Moreover, a
few well-designed network layers and training strategies are proposed to
further reduce computational cost while at the same time keep the performance
as remarkable as BERT model. Lastly, we develop two versions of TwinBERT for
retrieval and relevance tasks correspondingly, and both of them achieve close
or on-par performance to BERT-Base model.
The model was trained following the teacher-student framework and evaluated
with data from one of the major search engines. Experimental results showed
that the inference time was significantly reduced and was firstly controlled
around 20ms on CPUs while at the same time the performance gain from fine-tuned
BERT-Base model was mostly retained. Integration of the models into production
systems also demonstrated remarkable improvements on relevance metrics with
negligible influence on latency.
Related papers
- Efficient Document Ranking with Learnable Late Interactions [73.41976017860006]
Cross-Encoder (CE) and Dual-Encoder (DE) models are two fundamental approaches for query-document relevance in information retrieval.
To predict relevance, CE models use joint query-document embeddings, while DE models maintain factorized query and document embeddings.
Recently, late-interaction models have been proposed to realize more favorable latency-quality tradeoffs, by using a DE structure followed by a lightweight scorer.
arXiv Detail & Related papers (2024-06-25T22:50:48Z) - Breaking the Token Barrier: Chunking and Convolution for Efficient Long
Text Classification with BERT [0.0]
Transformer-based models, specifically BERT, have propelled research in various NLP tasks.
BERT models are limited to a maximum token limit of 512 tokens. Consequently, this makes it non-trivial to apply it in a practical setting with long input.
We propose a relatively simple extension to vanilla BERT architecture called ChunkBERT that allows finetuning of any pretrained models to perform inference on arbitrarily long text.
arXiv Detail & Related papers (2023-10-31T15:41:08Z) - Adapted Multimodal BERT with Layer-wise Fusion for Sentiment Analysis [84.12658971655253]
We propose Adapted Multimodal BERT, a BERT-based architecture for multimodal tasks.
adapter adjusts the pretrained language model for the task at hand, while the fusion layers perform task-specific, layer-wise fusion of audio-visual information with textual BERT representations.
In our ablations we see that this approach leads to efficient models, that can outperform their fine-tuned counterparts and are robust to input noise.
arXiv Detail & Related papers (2022-12-01T17:31:42Z) - MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z) - BiBERT: Accurate Fully Binarized BERT [69.35727280997617]
BiBERT is an accurate fully binarized BERT to eliminate the performance bottlenecks.
Our method yields impressive 56.3 times and 31.2 times saving on FLOPs and model size.
arXiv Detail & Related papers (2022-03-12T09:46:13Z) - Roof-BERT: Divide Understanding Labour and Join in Work [7.523253052992842]
Roof-BERT is a model with two underlying BERTs and a fusion layer on them.
One of the underlying BERTs encodes the knowledge resources and the other one encodes the original input sentences.
Experiment results on QA task reveal the effectiveness of the proposed model.
arXiv Detail & Related papers (2021-12-13T15:40:54Z) - Deploying a BERT-based Query-Title Relevance Classifier in a Production
System: a View from the Trenches [3.1219977244201056]
Bidirectional Representations from Transformers (BERT) model has been radically improving the performance of many Natural Language Processing (NLP) tasks.
It is challenging to scale BERT for low-latency and high- throughput industrial use cases due to its enormous size.
We successfully optimize a Query-Title Relevance (QTR) classifier for deployment via a compact model, which we name BERT Bidirectional Long Short-Term Memory (BertBiLSTM)
BertBiLSTM exceeds the off-the-shelf BERT model's performance in terms of accuracy and efficiency for the aforementioned real-world production task
arXiv Detail & Related papers (2021-08-23T14:28:23Z) - BinaryBERT: Pushing the Limit of BERT Quantization [74.65543496761553]
We propose BinaryBERT, which pushes BERT quantization to the limit with weight binarization.
We find that a binary BERT is hard to be trained directly than a ternary counterpart due to its complex and irregular loss landscapes.
Empirical results show that BinaryBERT has negligible performance drop compared to the full-precision BERT-base.
arXiv Detail & Related papers (2020-12-31T16:34:54Z) - DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference [69.93692147242284]
Large-scale pre-trained language models such as BERT have brought significant improvements to NLP applications.
We propose a simple but effective method, DeeBERT, to accelerate BERT inference.
Experiments show that DeeBERT is able to save up to 40% inference time with minimal degradation in model quality.
arXiv Detail & Related papers (2020-04-27T17:58:05Z) - ColBERT: Efficient and Effective Passage Search via Contextualized Late
Interaction over BERT [24.288824715337483]
ColBERT is a novel ranking model that adapts deep LMs for efficient retrieval.
We extensively evaluate ColBERT using two recent passage search datasets.
arXiv Detail & Related papers (2020-04-27T14:21:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.