Distillation versus Contrastive Learning: How to Train Your Rerankers
- URL: http://arxiv.org/abs/2507.08336v2
- Date: Fri, 01 Aug 2025 22:33:29 GMT
- Title: Distillation versus Contrastive Learning: How to Train Your Rerankers
- Authors: Zhichao Xu, Zhiqi Huang, Shengyao Zhuang, Vivek Srikumar,
- Abstract summary: Two strategies are widely used to train text rerankers: contrastive learning and knowledge distillation.<n>This paper empirically compares these strategies by training rerankers of different sizes and architectures using both methods on the same data.<n>Our results show that knowledge distillation generally yields better in-domain and out-of-domain ranking performance than contrastive learning when distilling from a larger teacher model.
- Score: 37.43565487845178
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Training effective text rerankers is crucial for information retrieval. Two strategies are widely used: contrastive learning (optimizing directly on ground-truth labels) and knowledge distillation (transferring knowledge from a larger reranker). While both have been studied extensively, a clear comparison of their effectiveness for training cross-encoder rerankers under practical conditions is needed. This paper empirically compares these strategies by training rerankers of different sizes and architectures using both methods on the same data, with a strong contrastive learning model acting as the distillation teacher. Our results show that knowledge distillation generally yields better in-domain and out-of-domain ranking performance than contrastive learning when distilling from a larger teacher model. This finding is consistent across student model sizes and architectures. However, distilling from a teacher of the same capacity does not provide the same advantage, particularly for out-of-domain tasks. These findings offer practical guidance for choosing a training strategy based on available teacher models. We recommend using knowledge distillation to train smaller rerankers if a larger, more powerful teacher is accessible; in its absence, contrastive learning remains a robust baseline.
Related papers
- Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation [64.15918654558816]
Self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only.<n> Experimental results on real-world affective computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods.
arXiv Detail & Related papers (2025-04-19T14:08:56Z) - ReffAKD: Resource-efficient Autoencoder-based Knowledge Distillation [3.301728339780329]
We propose an innovative method to boost Knowledge Distillation efficiency without the need for resource-heavy teacher models.
In our work, we propose an efficient method for generating soft labels, thereby eliminating the need for a large teacher model.
Our experiments on various datasets, including CIFAR-100, Tiny Imagenet, and Fashion MNIST, demonstrate the superior resource efficiency of our approach.
arXiv Detail & Related papers (2024-04-15T15:54:30Z) - Cooperative Knowledge Distillation: A Learner Agnostic Approach [15.414204257189596]
We formulate a novel form of knowledge distillation in which many models can act as both students and teachers.
Because different models may have different strengths and weaknesses, all models can act as either students or teachers.
arXiv Detail & Related papers (2024-02-02T17:31:50Z) - Teaching What You Should Teach: A Data-Based Distillation Method [20.595460553747163]
We introduce the "Teaching what you Should Teach" strategy into a knowledge distillation framework.
We propose a data-based distillation method named "TST" that searches for desirable augmented samples to assist in distilling more efficiently and rationally.
To be specific, we design a neural network-based data augmentation module with priori bias, which assists in finding what meets the teacher's strengths but the student's weaknesses.
arXiv Detail & Related papers (2022-12-11T06:22:14Z) - ARDIR: Improving Robustness using Knowledge Distillation of Internal
Representation [2.0875529088206553]
We propose Adversarial Robust Distillation with Internal Representation(ARDIR) to utilize knowledge distillation even more effectively.
ARDIR uses the internal representation of the teacher model as a label for adversarial training.
We show that ARDIR outperforms previous methods in our experiments.
arXiv Detail & Related papers (2022-11-01T03:11:59Z) - Efficient Knowledge Distillation from Model Checkpoints [36.329429655242535]
We show that a weak snapshot ensemble of several intermediate models from a same training trajectory can outperform a strong ensemble of independently trained and fully converged models.
We propose an optimal intermediate teacher selection algorithm based on maximizing the total task-related mutual information.
arXiv Detail & Related papers (2022-10-12T17:55:30Z) - On the benefits of knowledge distillation for adversarial robustness [53.41196727255314]
We show that knowledge distillation can be used directly to boost the performance of state-of-the-art models in adversarial robustness.
We present Adversarial Knowledge Distillation (AKD), a new framework to improve a model's robust performance.
arXiv Detail & Related papers (2022-03-14T15:02:13Z) - Fixing the Teacher-Student Knowledge Discrepancy in Distillation [72.4354883997316]
We propose a novel student-dependent distillation method, knowledge consistent distillation, which makes teacher's knowledge more consistent with the student.
Our method is very flexible that can be easily combined with other state-of-the-art approaches.
arXiv Detail & Related papers (2021-03-31T06:52:20Z) - Show, Attend and Distill:Knowledge Distillation via Attention-based
Feature Matching [14.666392130118307]
Most studies manually tie intermediate features of the teacher and student, and transfer knowledge through pre-defined links.
We introduce an effective and efficient feature distillation method utilizing all the feature levels of the teacher without manually selecting the links.
arXiv Detail & Related papers (2021-02-05T03:07:57Z) - Collaborative Teacher-Student Learning via Multiple Knowledge Transfer [79.45526596053728]
We propose a collaborative teacher-student learning via multiple knowledge transfer (CTSL-MKT)
It allows multiple students learn knowledge from both individual instances and instance relations in a collaborative way.
The experiments and ablation studies on four image datasets demonstrate that the proposed CTSL-MKT significantly outperforms the state-of-the-art KD methods.
arXiv Detail & Related papers (2021-01-21T07:17:04Z) - Wasserstein Contrastive Representation Distillation [114.24609306495456]
We propose Wasserstein Contrastive Representation Distillation (WCoRD), which leverages both primal and dual forms of Wasserstein distance for knowledge distillation.
The dual form is used for global knowledge transfer, yielding a contrastive learning objective that maximizes the lower bound of mutual information between the teacher and the student networks.
Experiments demonstrate that the proposed WCoRD method outperforms state-of-the-art approaches on privileged information distillation, model compression and cross-modal transfer.
arXiv Detail & Related papers (2020-12-15T23:43:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.