Related papers: Distilling Dense Representations for Ranking using Tightly-Coupled Teachers

Distilling Dense Representations for Ranking using Tightly-Coupled Teachers

URL: http://arxiv.org/abs/2010.11386v1
Date: Thu, 22 Oct 2020 02:26:01 GMT
Title: Distilling Dense Representations for Ranking using Tightly-Coupled Teachers
Authors: Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin
Abstract summary: We apply knowledge distillation to improve the recently proposed late-interaction ColBERT model. We distill the knowledge from ColBERT's expressive MaxSim operator for computing relevance scores into a simple dot product. We empirically show that our approach improves query latency and greatly reduces the onerous storage requirements of ColBERT.
Score: 52.85472936277762
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present an approach to ranking with dense representations that applies knowledge distillation to improve the recently proposed late-interaction ColBERT model. Specifically, we distill the knowledge from ColBERT's expressive MaxSim operator for computing relevance scores into a simple dot product, thus enabling single-step ANN search. Our key insight is that during distillation, tight coupling between the teacher model and the student model enables more flexible distillation strategies and yields better learned representations. We empirically show that our approach improves query latency and greatly reduces the onerous storage requirements of ColBERT, while only making modest sacrifices in terms of effectiveness. By combining our dense representations with sparse representations derived from document expansion, we are able to approach the effectiveness of a standard cross-encoder reranker using BERT that is orders of magnitude slower.

Related papers

Faithful Label-free Knowledge Distillation [8.572967695281054]
This paper presents a label-free knowledge distillation approach called Teacher in the Middle (TinTeM) It produces a more faithful student, which better replicates the behavior of the teacher network across a range of benchmarks testing model robustness, generalisability and out-of-distribution detection.
arXiv Detail & Related papers (2024-11-22T01:48:44Z)
Learning to Maximize Mutual Information for Chain-of-Thought Distillation [13.660167848386806]
Distilling Step-by-Step(DSS) has demonstrated promise by imbuing smaller models with the superior reasoning capabilities of their larger counterparts. However, DSS overlooks the intrinsic relationship between the two training tasks, leading to ineffective integration of CoT knowledge with the task of label prediction. We propose a variational approach to solve this problem using a learning-based method.
arXiv Detail & Related papers (2024-03-05T22:21:45Z)
Distillation from Heterogeneous Models for Top-K Recommendation [43.83625440616829]
HetComp is a framework that guides the student model by transferring sequences of knowledge from teachers' trajectories. HetComp significantly improves the distillation quality and the generalization of the student model.
arXiv Detail & Related papers (2023-03-02T10:23:50Z)
EmbedDistill: A Geometric Knowledge Distillation for Information Retrieval [83.79667141681418]
Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR) We propose a novel distillation approach that leverages the relative geometry among queries and documents learned by the large teacher model. We show that our approach successfully distills from both dual-encoder (DE) and cross-encoder (CE) teacher models to 1/10th size asymmetric students that can retain 95-97% of the teacher performance.
arXiv Detail & Related papers (2023-01-27T22:04:37Z)
Knowledge Distillation Meets Open-Set Semi-Supervised Learning [69.21139647218456]
We propose a novel em modelname (bfem shortname) method dedicated for distilling representational knowledge semantically from a pretrained teacher to a target student. At the problem level, this establishes an interesting connection between knowledge distillation with open-set semi-supervised learning (SSL) Our shortname outperforms significantly previous state-of-the-art knowledge distillation methods on both coarse object classification and fine face recognition tasks.
arXiv Detail & Related papers (2022-05-13T15:15:27Z)
RefBERT: Compressing BERT by Referencing to Pre-computed Representations [19.807272592342148]
RefBERT can beat the vanilla TinyBERT over 8.1% and achieves more than 94% of the performance of $BERTBASE$ on the GLUE benchmark. RefBERT is 7.4x smaller and 9.5x faster on inference than BERT$_rm BASE$.
arXiv Detail & Related papers (2021-06-11T01:22:08Z)
MixKD: Towards Efficient Distillation of Large-scale Language Models [129.73786264834894]
We propose MixKD, a data-agnostic distillation framework, to endow the resulting model with stronger generalization ability. We prove from a theoretical perspective that under reasonable conditions MixKD gives rise to a smaller gap between the error and the empirical error. Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.
arXiv Detail & Related papers (2020-11-01T18:47:51Z)
Contrastive Distillation on Intermediate Representations for Language Model Compression [89.31786191358802]
We propose Contrastive Distillation on Intermediate Representations (CoDIR) as a principled knowledge distillation framework. By learning to distinguish positive sample from a large set of negative samples, CoDIR facilitates the student's exploitation of rich information in teacher's hidden layers. CoDIR can be readily applied to compress large-scale language models in both pre-training and finetuning stages, and achieves superb performance on the GLUE benchmark.
arXiv Detail & Related papers (2020-09-29T17:31:43Z)
Deep Semi-supervised Knowledge Distillation for Overlapping Cervical Cell Instance Segmentation [54.49894381464853]
We propose to leverage both labeled and unlabeled data for instance segmentation with improved accuracy by knowledge distillation. We propose a novel Mask-guided Mean Teacher framework with Perturbation-sensitive Sample Mining. Experiments show that the proposed method improves the performance significantly compared with the supervised method learned from labeled data only.
arXiv Detail & Related papers (2020-07-21T13:27:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.