Related papers: BERT2DNN: BERT Distillation with Massive Unlabeled Data for Online E-Commerce Search

BERT2DNN: BERT Distillation with Massive Unlabeled Data for Online E-Commerce Search

URL: http://arxiv.org/abs/2010.10442v1
Date: Tue, 20 Oct 2020 16:56:04 GMT
Title: BERT2DNN: BERT Distillation with Massive Unlabeled Data for Online E-Commerce Search
Authors: Yunjiang Jiang, Yue Shang, Ziyang Liu, Hongwei Shen, Yun Xiao, Wei Xiong, Sulong Xu, Weipeng Yan and Di Jin
Abstract summary: Relevance has significant impact on user experience and business profit for e-commerce search platform. We propose a data-driven framework for search relevance prediction, by distilling knowledge from BERT and related multi-layer Transformer teacher models. We present experimental results on both in-house e-commerce search relevance data and a public data set on sentiment analysis from the GLUE benchmark.
Score: 34.951088875638696
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Relevance has significant impact on user experience and business profit for e-commerce search platform. In this work, we propose a data-driven framework for search relevance prediction, by distilling knowledge from BERT and related multi-layer Transformer teacher models into simple feed-forward networks with large amount of unlabeled data. The distillation process produces a student model that recovers more than 97\% test accuracy of teacher models on new queries, at a serving cost that's several magnitude lower (latency 150x lower than BERT-Base and 15x lower than the most efficient BERT variant, TinyBERT). The applications of temperature rescaling and teacher model stacking further boost model accuracy, without increasing the student model complexity. We present experimental results on both in-house e-commerce search relevance data as well as a public data set on sentiment analysis from the GLUE benchmark. The latter takes advantage of another related public data set of much larger scale, while disregarding its potentially noisy labels. Embedding analysis and case study on the in-house data further highlight the strength of the resulting model. By making the data processing and model training source code public, we hope the techniques presented here can help reduce energy consumption of the state of the art Transformer models and also level the playing field for small organizations lacking access to cutting edge machine learning hardwares.

Related papers

Knowledge Distillation for Enhancing Walmart E-commerce Search Relevance Using Large Language Models [6.324684465674387]
Large language models (LLMs) offer superior ranking capabilities, but it is challenging to deploy in real-time systems due to the high-latency requirements.<n>We propose a novel framework that distills a high performing LLM into a more efficient, low-latency student model.<n>The student model has been successfully deployed in production at Walmart.com with significantly positive metrics.
arXiv Detail & Related papers (2025-05-11T20:00:00Z)
Learning with Less: Knowledge Distillation from Large Language Models via Unlabeled Data [54.934578742209716]
In real-world NLP applications, Large Language Models (LLMs) offer promising solutions due to their extensive training on vast datasets. LLKD is an adaptive sample selection method that incorporates signals from both the teacher and student. Our comprehensive experiments show that LLKD achieves superior performance across various datasets with higher data efficiency.
arXiv Detail & Related papers (2024-11-12T18:57:59Z)
uDistil-Whisper: Label-Free Data Filtering for Knowledge Distillation in Low-Data Regimes [34.947522647009436]
We show that best-distilled models outperform the teacher model by 5-7 WER points and are on par with or outperform similar supervised data filtering setups. Our models are also 25-50% more compute- and memory-efficient while maintaining performance equal to or better than that of the teacher model.
arXiv Detail & Related papers (2024-07-01T13:07:01Z)
Directed Acyclic Graph Factorization Machines for CTR Prediction via Knowledge Distillation [65.62538699160085]
We propose a Directed Acyclic Graph Factorization Machine (KD-DAGFM) to learn the high-order feature interactions from existing complex interaction models for CTR prediction via Knowledge Distillation. KD-DAGFM achieves the best performance with less than 21.5% FLOPs of the state-of-the-art method on both online and offline experiments.
arXiv Detail & Related papers (2022-11-21T03:09:42Z)
Which Student is Best? A Comprehensive Knowledge Distillation Exam for Task-Specific BERT Models [3.303435360096988]
We perform knowledge distillation benchmark from task-specific BERT-base teacher models to various student models. Our experiment involves 12 datasets grouped in two tasks: text classification and sequence labeling in the Indonesian language. Our experiments show that, despite the rising popularity of Transformer-based models, using BiLSTM and CNN student models provide the best trade-off between performance and computational resource.
arXiv Detail & Related papers (2022-01-03T10:07:13Z)
Data Distillation for Text Classification [7.473576666437028]
Data distillation aims to distill the knowledge from a large training dataset down to a smaller and synthetic one. We develop a novel data distillation method for text classification. The results that the distilled data with the size of 0.1% of the original text data achieves approximately 90% performance of the original is rather impressive.
arXiv Detail & Related papers (2021-04-17T04:54:54Z)
Learning to Augment for Data-Scarce Domain BERT Knowledge Distillation [55.34995029082051]
We propose a method to learn to augment for data-scarce domain BERT knowledge distillation. We show that the proposed method significantly outperforms state-of-the-art baselines on four different tasks.
arXiv Detail & Related papers (2021-01-20T13:07:39Z)
Pre-trained Summarization Distillation [121.14806854092672]
Recent work on distilling BERT for classification and regression tasks shows strong performance using direct knowledge distillation. Alternatively, machine translation practitioners distill using pseudo-labeling, where a small model is trained on the translations of a larger model. A third, simpler approach is to'shrink and fine-tune' (SFT), which avoids any explicit distillation by copying parameters to a smaller student model and then fine-tuning.
arXiv Detail & Related papers (2020-10-24T23:15:43Z)
A Comparison of LSTM and BERT for Small Corpus [0.0]
Recent advancements in the NLP field showed that transfer learning helps with achieving state-of-the-art results for new tasks by tuning pre-trained models instead of starting from scratch. In this paper we focus on a real-life scenario that scientists in academia and industry face frequently: given a small dataset, can we use a large pre-trained model like BERT and get better results than simple models? Our experimental results show that bidirectional LSTM models can achieve significantly higher results than a BERT model for a small dataset and these simple models get trained in much less time than tuning the pre-trained counterparts.
arXiv Detail & Related papers (2020-09-11T14:01:14Z)
MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers [117.67424061746247]
We present a simple and effective approach to compress large Transformer based pre-trained models. We propose distilling the self-attention module of the last Transformer layer of the teacher, which is effective and flexible for the student. Experimental results demonstrate that our monolingual model outperforms state-of-the-art baselines in different parameter size of student models.
arXiv Detail & Related papers (2020-02-25T15:21:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.