BERT2DNN: BERT Distillation with Massive Unlabeled Data for Online
E-Commerce Search
- URL: http://arxiv.org/abs/2010.10442v1
- Date: Tue, 20 Oct 2020 16:56:04 GMT
- Title: BERT2DNN: BERT Distillation with Massive Unlabeled Data for Online
E-Commerce Search
- Authors: Yunjiang Jiang, Yue Shang, Ziyang Liu, Hongwei Shen, Yun Xiao, Wei
Xiong, Sulong Xu, Weipeng Yan and Di Jin
- Abstract summary: Relevance has significant impact on user experience and business profit for e-commerce search platform.
We propose a data-driven framework for search relevance prediction, by distilling knowledge from BERT and related multi-layer Transformer teacher models.
We present experimental results on both in-house e-commerce search relevance data and a public data set on sentiment analysis from the GLUE benchmark.
- Score: 34.951088875638696
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Relevance has significant impact on user experience and business profit for
e-commerce search platform. In this work, we propose a data-driven framework
for search relevance prediction, by distilling knowledge from BERT and related
multi-layer Transformer teacher models into simple feed-forward networks with
large amount of unlabeled data. The distillation process produces a student
model that recovers more than 97\% test accuracy of teacher models on new
queries, at a serving cost that's several magnitude lower (latency 150x lower
than BERT-Base and 15x lower than the most efficient BERT variant, TinyBERT).
The applications of temperature rescaling and teacher model stacking further
boost model accuracy, without increasing the student model complexity.
We present experimental results on both in-house e-commerce search relevance
data as well as a public data set on sentiment analysis from the GLUE
benchmark. The latter takes advantage of another related public data set of
much larger scale, while disregarding its potentially noisy labels. Embedding
analysis and case study on the in-house data further highlight the strength of
the resulting model. By making the data processing and model training source
code public, we hope the techniques presented here can help reduce energy
consumption of the state of the art Transformer models and also level the
playing field for small organizations lacking access to cutting edge machine
learning hardwares.
Related papers
- EmbedDistill: A Geometric Knowledge Distillation for Information
Retrieval [83.79667141681418]
Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR)
We propose a novel distillation approach that leverages the relative geometry among queries and documents learned by the large teacher model.
We show that our approach successfully distills from both dual-encoder (DE) and cross-encoder (CE) teacher models to 1/10th size asymmetric students that can retain 95-97% of the teacher performance.
arXiv Detail & Related papers (2023-01-27T22:04:37Z) - Directed Acyclic Graph Factorization Machines for CTR Prediction via
Knowledge Distillation [65.62538699160085]
We propose a Directed Acyclic Graph Factorization Machine (KD-DAGFM) to learn the high-order feature interactions from existing complex interaction models for CTR prediction via Knowledge Distillation.
KD-DAGFM achieves the best performance with less than 21.5% FLOPs of the state-of-the-art method on both online and offline experiments.
arXiv Detail & Related papers (2022-11-21T03:09:42Z) - Which Student is Best? A Comprehensive Knowledge Distillation Exam for
Task-Specific BERT Models [3.303435360096988]
We perform knowledge distillation benchmark from task-specific BERT-base teacher models to various student models.
Our experiment involves 12 datasets grouped in two tasks: text classification and sequence labeling in the Indonesian language.
Our experiments show that, despite the rising popularity of Transformer-based models, using BiLSTM and CNN student models provide the best trade-off between performance and computational resource.
arXiv Detail & Related papers (2022-01-03T10:07:13Z) - Data Distillation for Text Classification [7.473576666437028]
Data distillation aims to distill the knowledge from a large training dataset down to a smaller and synthetic one.
We develop a novel data distillation method for text classification.
The results that the distilled data with the size of 0.1% of the original text data achieves approximately 90% performance of the original is rather impressive.
arXiv Detail & Related papers (2021-04-17T04:54:54Z) - Learning to Augment for Data-Scarce Domain BERT Knowledge Distillation [55.34995029082051]
We propose a method to learn to augment for data-scarce domain BERT knowledge distillation.
We show that the proposed method significantly outperforms state-of-the-art baselines on four different tasks.
arXiv Detail & Related papers (2021-01-20T13:07:39Z) - Pre-trained Summarization Distillation [121.14806854092672]
Recent work on distilling BERT for classification and regression tasks shows strong performance using direct knowledge distillation.
Alternatively, machine translation practitioners distill using pseudo-labeling, where a small model is trained on the translations of a larger model.
A third, simpler approach is to'shrink and fine-tune' (SFT), which avoids any explicit distillation by copying parameters to a smaller student model and then fine-tuning.
arXiv Detail & Related papers (2020-10-24T23:15:43Z) - A Comparison of LSTM and BERT for Small Corpus [0.0]
Recent advancements in the NLP field showed that transfer learning helps with achieving state-of-the-art results for new tasks by tuning pre-trained models instead of starting from scratch.
In this paper we focus on a real-life scenario that scientists in academia and industry face frequently: given a small dataset, can we use a large pre-trained model like BERT and get better results than simple models?
Our experimental results show that bidirectional LSTM models can achieve significantly higher results than a BERT model for a small dataset and these simple models get trained in much less time than tuning the pre-trained counterparts.
arXiv Detail & Related papers (2020-09-11T14:01:14Z) - MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression
of Pre-Trained Transformers [117.67424061746247]
We present a simple and effective approach to compress large Transformer based pre-trained models.
We propose distilling the self-attention module of the last Transformer layer of the teacher, which is effective and flexible for the student.
Experimental results demonstrate that our monolingual model outperforms state-of-the-art baselines in different parameter size of student models.
arXiv Detail & Related papers (2020-02-25T15:21:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.