DiPair: Fast and Accurate Distillation for Trillion-Scale Text Matching
and Pair Modeling
- URL: http://arxiv.org/abs/2010.03099v1
- Date: Wed, 7 Oct 2020 01:19:23 GMT
- Title: DiPair: Fast and Accurate Distillation for Trillion-Scale Text Matching
and Pair Modeling
- Authors: Jiecao Chen, Liu Yang, Karthik Raman, Michael Bendersky, Jung-Jung
Yeh, Yun Zhou, Marc Najork, Danyang Cai, Ehsan Emadzadeh
- Abstract summary: We propose DiPair -- a framework for distilling fast and accurate models on text pair tasks.
It is both highly scalable and offers improved quality-speed tradeoffs.
Empirical studies conducted on both academic and real-world e-commerce benchmarks demonstrate the efficacy of the proposed approach.
- Score: 24.07558669713062
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained models like BERT (Devlin et al., 2018) have dominated NLP / IR
applications such as single sentence classification, text pair classification,
and question answering. However, deploying these models in real systems is
highly non-trivial due to their exorbitant computational costs. A common remedy
to this is knowledge distillation (Hinton et al., 2015), leading to faster
inference. However -- as we show here -- existing works are not optimized for
dealing with pairs (or tuples) of texts. Consequently, they are either not
scalable or demonstrate subpar performance. In this work, we propose DiPair --
a novel framework for distilling fast and accurate models on text pair tasks.
Coupled with an end-to-end training strategy, DiPair is both highly scalable
and offers improved quality-speed tradeoffs. Empirical studies conducted on
both academic and real-world e-commerce benchmarks demonstrate the efficacy of
the proposed approach with speedups of over 350x and minimal quality drop
relative to the cross-attention teacher BERT model.
Related papers
- The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - Approximated Prompt Tuning for Vision-Language Pre-trained Models [54.326232586461614]
In vision-language pre-trained models, prompt tuning often requires a large number of learnable tokens to bridge the gap between the pre-training and downstream tasks.
We propose a novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer learning.
arXiv Detail & Related papers (2023-06-27T05:43:47Z) - BOOT: Data-free Distillation of Denoising Diffusion Models with
Bootstrapping [64.54271680071373]
Diffusion models have demonstrated excellent potential for generating diverse images.
Knowledge distillation has been recently proposed as a remedy that can reduce the number of inference steps to one or a few.
We present a novel technique called BOOT, that overcomes limitations with an efficient data-free distillation algorithm.
arXiv Detail & Related papers (2023-06-08T20:30:55Z) - Adapted Multimodal BERT with Layer-wise Fusion for Sentiment Analysis [84.12658971655253]
We propose Adapted Multimodal BERT, a BERT-based architecture for multimodal tasks.
adapter adjusts the pretrained language model for the task at hand, while the fusion layers perform task-specific, layer-wise fusion of audio-visual information with textual BERT representations.
In our ablations we see that this approach leads to efficient models, that can outperform their fine-tuned counterparts and are robust to input noise.
arXiv Detail & Related papers (2022-12-01T17:31:42Z) - Performance-Efficiency Trade-Offs in Adapting Language Models to Text
Classification Tasks [4.101451083646731]
We study how different training procedures that adapt LMs to text classification perform, as we vary model and train set size.
Our findings suggest that even though fine-tuning and prompting work well to train large LMs on large train sets, there are more efficient alternatives that can reduce compute or data cost.
arXiv Detail & Related papers (2022-10-21T15:10:09Z) - MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z) - SE3M: A Model for Software Effort Estimation Using Pre-trained Embedding
Models [0.8287206589886881]
This paper proposes to evaluate the effectiveness of pre-trained embeddings models.
Generic pre-trained models for both approaches went through a fine-tuning process.
Results were very promising, realizing that pre-trained models can be used to estimate software effort based only on requirements texts.
arXiv Detail & Related papers (2020-06-30T14:15:38Z) - The Right Tool for the Job: Matching Model and Instance Complexities [62.95183777679024]
As NLP models become larger, executing a trained model requires significant computational resources incurring monetary and environmental costs.
We propose a modification to contextual representation fine-tuning which, during inference, allows for an early (and fast) "exit"
We test our proposed modification on five different datasets in two tasks: three text classification datasets and two natural language inference benchmarks.
arXiv Detail & Related papers (2020-04-16T04:28:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.