DistillCSE: Distilled Contrastive Learning for Sentence Embeddings
- URL: http://arxiv.org/abs/2310.13499v2
- Date: Mon, 30 Oct 2023 09:23:36 GMT
- Title: DistillCSE: Distilled Contrastive Learning for Sentence Embeddings
- Authors: Jiahao Xu and Wei Shao and Lihui Chen and Lemao Liu
- Abstract summary: This paper proposes the DistillCSE framework, which performs contrastive learning under the self-training paradigm with knowledge distillation.
The potential advantage of DistillCSE is its self-enhancing feature: using a base model to provide additional supervision signals, a stronger model may be learned through knowledge distillation.
The paper proposes two simple yet effective solutions for knowledge distillation: a Group-P shuffling strategy as an implicit regularization and the averaging logits from multiple teacher components.
- Score: 32.6620719893457
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper proposes the DistillCSE framework, which performs contrastive
learning under the self-training paradigm with knowledge distillation. The
potential advantage of DistillCSE is its self-enhancing feature: using a base
model to provide additional supervision signals, a stronger model may be
learned through knowledge distillation. However, the vanilla DistillCSE through
the standard implementation of knowledge distillation only achieves marginal
improvements due to severe overfitting. The further quantitative analyses
demonstrate the reason that the standard knowledge distillation exhibits a
relatively large variance of the teacher model's logits due to the essence of
contrastive learning. To mitigate the issue induced by high variance, this
paper accordingly proposed two simple yet effective solutions for knowledge
distillation: a Group-P shuffling strategy as an implicit regularization and
the averaging logits from multiple teacher components. Experiments on standard
benchmarks demonstrate that the proposed DistillCSE outperforms many strong
baseline methods and yields a new state-of-the-art performance.
Related papers
- MLKD-BERT: Multi-level Knowledge Distillation for Pre-trained Language Models [4.404914701832396]
We propose a novel knowledge distillation method MLKD-BERT to distill multi-level knowledge in teacher-student framework.
Our method outperforms state-of-the-art knowledge distillation methods on BERT.
In addition, MLKD-BERT can flexibly set student attention head number, allowing for substantial inference time decrease with little performance drop.
arXiv Detail & Related papers (2024-07-03T03:03:30Z) - The Staged Knowledge Distillation in Video Classification: Harmonizing
Student Progress by a Complementary Weakly Supervised Framework [21.494759678807686]
We propose a new weakly supervised learning framework for knowledge distillation in video classification.
Our approach leverages the concept of substage-based learning to distill knowledge based on the combination of student substages and the correlation of corresponding substages.
Our proposed substage-based distillation approach has the potential to inform future research on label-efficient learning for video data.
arXiv Detail & Related papers (2023-07-11T12:10:42Z) - Knowledge Distillation Performs Partial Variance Reduction [93.6365393721122]
Knowledge distillation is a popular approach for enhancing the performance of ''student'' models.
The underlying mechanics behind knowledge distillation (KD) are still not fully understood.
We show that KD can be interpreted as a novel type of variance reduction mechanism.
arXiv Detail & Related papers (2023-05-27T21:25:55Z) - Class-aware Information for Logit-based Knowledge Distillation [16.634819319915923]
We propose a Class-aware Logit Knowledge Distillation (CLKD) method, that extents the logit distillation in both instance-level and class-level.
CLKD enables the student model mimic higher semantic information from the teacher model, hence improving the distillation performance.
arXiv Detail & Related papers (2022-11-27T09:27:50Z) - Unbiased Knowledge Distillation for Recommendation [66.82575287129728]
Knowledge distillation (KD) has been applied in recommender systems (RS) to reduce inference latency.
Traditional solutions first train a full teacher model from the training data, and then transfer its knowledge to supervise the learning of a compact student model.
We find such a standard distillation paradigm would incur serious bias issue -- popular items are more heavily recommended after the distillation.
arXiv Detail & Related papers (2022-11-27T05:14:03Z) - DETRDistill: A Universal Knowledge Distillation Framework for
DETR-families [11.9748352746424]
Transformer-based detectors (DETRs) have attracted great attention due to their sparse training paradigm and the removal of post-processing operations.
Knowledge distillation (KD) can be employed to compress the huge model by constructing a universal teacher-student learning framework.
arXiv Detail & Related papers (2022-11-17T13:35:11Z) - A Novel Self-Knowledge Distillation Approach with Siamese Representation
Learning for Action Recognition [6.554259611868312]
Self-knowledge distillation is an effective transfer of knowledge from a heavy network (teacher) to a small network (student) to boost students' performance.
This paper introduces a novel Self-knowledge distillation approach via Siamese representation learning.
arXiv Detail & Related papers (2022-09-03T01:56:58Z) - Mind the Gap in Distilling StyleGANs [100.58444291751015]
StyleGAN family is one of the most popular Generative Adversarial Networks (GANs) for unconditional generation.
This paper provides a comprehensive study of distilling from the popular StyleGAN-like architecture.
arXiv Detail & Related papers (2022-08-18T14:18:29Z) - On the benefits of knowledge distillation for adversarial robustness [53.41196727255314]
We show that knowledge distillation can be used directly to boost the performance of state-of-the-art models in adversarial robustness.
We present Adversarial Knowledge Distillation (AKD), a new framework to improve a model's robust performance.
arXiv Detail & Related papers (2022-03-14T15:02:13Z) - Why distillation helps: a statistical perspective [69.90148901064747]
Knowledge distillation is a technique for improving the performance of a simple "student" model.
While this simple approach has proven widely effective, a basic question remains unresolved: why does distillation help?
We show how distillation complements existing negative mining techniques for extreme multiclass retrieval.
arXiv Detail & Related papers (2020-05-21T01:49:51Z) - Residual Knowledge Distillation [96.18815134719975]
This work proposes Residual Knowledge Distillation (RKD), which further distills the knowledge by introducing an assistant (A)
In this way, S is trained to mimic the feature maps of T, and A aids this process by learning the residual error between them.
Experiments show that our approach achieves appealing results on popular classification datasets, CIFAR-100 and ImageNet.
arXiv Detail & Related papers (2020-02-21T07:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.