ERNIE-Search: Bridging Cross-Encoder with Dual-Encoder via Self
On-the-fly Distillation for Dense Passage Retrieval
- URL: http://arxiv.org/abs/2205.09153v1
- Date: Wed, 18 May 2022 18:05:13 GMT
- Title: ERNIE-Search: Bridging Cross-Encoder with Dual-Encoder via Self
On-the-fly Distillation for Dense Passage Retrieval
- Authors: Yuxiang Lu, Yiding Liu, Jiaxiang Liu, Yunsheng Shi, Zhengjie Huang,
Shikun Feng Yu Sun, Hao Tian, Hua Wu, Shuaiqiang Wang, Dawei Yin, Haifeng
Wang
- Abstract summary: We propose a novel distillation method that significantly advances cross-architecture distillation for dual-encoders.
Our method 1) introduces a self on-the-fly distillation method that can effectively distill late interaction (i.e., ColBERT) to vanilla dual-encoder, and 2) incorporates a cascade distillation process to further improve the performance with a cross-encoder teacher.
- Score: 54.54667085792404
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural retrievers based on pre-trained language models (PLMs), such as
dual-encoders, have achieved promising performance on the task of open-domain
question answering (QA). Their effectiveness can further reach new
state-of-the-arts by incorporating cross-architecture knowledge distillation.
However, most of the existing studies just directly apply conventional
distillation methods. They fail to consider the particular situation where the
teacher and student have different structures. In this paper, we propose a
novel distillation method that significantly advances cross-architecture
distillation for dual-encoders. Our method 1) introduces a self on-the-fly
distillation method that can effectively distill late interaction (i.e.,
ColBERT) to vanilla dual-encoder, and 2) incorporates a cascade distillation
process to further improve the performance with a cross-encoder teacher.
Extensive experiments are conducted to validate that our proposed solution
outperforms strong baselines and establish a new state-of-the-art on
open-domain QA benchmarks.
Related papers
- How to Make Cross Encoder a Good Teacher for Efficient Image-Text Retrieval? [99.87554379608224]
Cross-modal similarity score distribution of cross-encoder is more concentrated while the result of dual-encoder is nearly normal.
Only the relative order between hard negatives conveys valid knowledge while the order information between easy negatives has little significance.
We propose a novel Contrastive Partial Ranking Distillation (DCPR) method which implements the objective of mimicking relative order between hard negative samples with contrastive learning.
arXiv Detail & Related papers (2024-07-10T09:10:01Z) - Dual Knowledge Distillation for Efficient Sound Event Detection [20.236008919003083]
Sound event detection (SED) is essential for recognizing specific sounds and their temporal locations within acoustic signals.
We introduce a novel framework referred to as dual knowledge distillation for developing efficient SED systems.
arXiv Detail & Related papers (2024-02-05T07:30:32Z) - Query Encoder Distillation via Embedding Alignment is a Strong Baseline
Method to Boost Dense Retriever Online Efficiency [4.254906060165999]
We show that even a 2-layer, BERT-based query encoder can still retain 92.5% of the full DE performance on the BEIR benchmark.
We hope that our findings will encourage the community to re-evaluate the trade-offs between method complexity and performance improvements.
arXiv Detail & Related papers (2023-06-05T06:53:55Z) - DisWOT: Student Architecture Search for Distillation WithOut Training [0.0]
We explore a novel training-free framework to search for the best student architectures for a given teacher.
Our work first empirically show that the optimal model under vanilla training cannot be the winner in distillation.
Our experiments on CIFAR, ImageNet and NAS-Bench-201 demonstrate that our technique achieves state-of-the-art results on different search spaces.
arXiv Detail & Related papers (2023-03-28T01:58:45Z) - DETRDistill: A Universal Knowledge Distillation Framework for
DETR-families [11.9748352746424]
Transformer-based detectors (DETRs) have attracted great attention due to their sparse training paradigm and the removal of post-processing operations.
Knowledge distillation (KD) can be employed to compress the huge model by constructing a universal teacher-student learning framework.
arXiv Detail & Related papers (2022-11-17T13:35:11Z) - Aligning Logits Generatively for Principled Black-Box Knowledge Distillation [49.43567344782207]
Black-Box Knowledge Distillation (B2KD) is a formulated problem for cloud-to-edge model compression with invisible data and models hosted on the server.
We formalize a two-step workflow consisting of deprivatization and distillation.
We propose a new method Mapping-Emulation KD (MEKD) that distills a black-box cumbersome model into a lightweight one.
arXiv Detail & Related papers (2022-05-21T02:38:16Z) - Localization Distillation for Object Detection [134.12664548771534]
Previous knowledge distillation (KD) methods for object detection mostly focus on feature imitation instead of mimicking the classification logits.
We present a novel localization distillation (LD) method which can efficiently transfer the localization knowledge from the teacher to the student.
We show that logit mimicking can outperform feature imitation and the absence of localization distillation is a critical reason for why logit mimicking underperforms for years.
arXiv Detail & Related papers (2022-04-12T17:14:34Z) - Delta Distillation for Efficient Video Processing [68.81730245303591]
We propose a novel knowledge distillation schema coined as Delta Distillation.
We demonstrate that these temporal variations can be effectively distilled due to the temporal redundancies within video frames.
As a by-product, delta distillation improves the temporal consistency of the teacher model.
arXiv Detail & Related papers (2022-03-17T20:13:30Z) - Differentiable Feature Aggregation Search for Knowledge Distillation [47.94874193183427]
We introduce the feature aggregation to imitate the multi-teacher distillation in the single-teacher distillation framework.
DFA is a two-stage Differentiable Feature Aggregation search method motivated by DARTS in neural architecture search.
Experimental results show that DFA outperforms existing methods on CIFAR-100 and CINIC-10 datasets.
arXiv Detail & Related papers (2020-08-02T15:42:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.