DisCo: Effective Knowledge Distillation For Contrastive Learning of
Sentence Embeddings
- URL: http://arxiv.org/abs/2112.05638v1
- Date: Fri, 10 Dec 2021 16:11:23 GMT
- Title: DisCo: Effective Knowledge Distillation For Contrastive Learning of
Sentence Embeddings
- Authors: Xing Wu, Chaochen Gao, Jue Wang, Liangjun Zang, Zhongyuan Wang,
Songlin Hu
- Abstract summary: We propose an enhanced knowledge distillation framework termed Distill-Contrast (DisCo)
DisCo transfers the capability of a large sentence embedding model to a small student model on large unlabelled data.
We also propose Contrastive Knowledge Distillation (CKD) to enhance the consistencies among teacher model training, KD, and student model finetuning.
- Score: 36.37939188680754
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive learning has been proven suitable for learning sentence
embeddings and can significantly improve the semantic textual similarity (STS)
tasks. Recently, large contrastive learning models, e.g., Sentence-T5, tend to
be proposed to learn more powerful sentence embeddings. Though effective, such
large models are hard to serve online due to computational resources or time
cost limits. To tackle that, knowledge distillation (KD) is commonly adopted,
which can compress a large "teacher" model into a small "student" model but
generally suffer from some performance loss. Here we propose an enhanced KD
framework termed Distill-Contrast (DisCo). The proposed DisCo framework firstly
utilizes KD to transfer the capability of a large sentence embedding model to a
small student model on large unlabelled data, and then finetunes the student
model with contrastive learning on labelled training data. For the KD process
in DisCo, we further propose Contrastive Knowledge Distillation (CKD) to
enhance the consistencies among teacher model training, KD, and student model
finetuning, which can probably improve performance like prompt learning.
Extensive experiments on 7 STS benchmarks show that student models trained with
the proposed DisCo and CKD suffer from little or even no performance loss and
consistently outperform the corresponding counterparts of the same parameter
size. Amazingly, our 110M student model can even outperform the latest
state-of-the-art (SOTA) model, i.e., Sentence-T5(11B), with only 1% parameters.
Related papers
- Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling [81.00825302340984]
We introduce Speculative Knowledge Distillation (SKD) to generate high-quality training data on-the-fly.
In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution.
We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following.
arXiv Detail & Related papers (2024-10-15T06:51:25Z) - Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them.
This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model.
OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z) - Relational Representation Distillation [6.24302896438145]
We introduce Representation Distillation (RRD) to explore and reinforce relationships between teacher and student models.
Inspired by self-supervised learning principles, it uses a relaxed contrastive loss that focuses on similarity than exact replication.
Our approach demonstrates superior performance on CIFAR-100 and ImageNet ILSVRC-2012 and sometimes even outperforms the teacher network when combined with KD.
arXiv Detail & Related papers (2024-07-16T14:56:13Z) - Revisiting Knowledge Distillation for Autoregressive Language Models [88.80146574509195]
We propose a simple yet effective adaptive teaching approach (ATKD) to improve the knowledge distillation (KD)
The core of ATKD is to reduce rote learning and make teaching more diverse and flexible.
Experiments on 8 LM tasks show that, with the help of ATKD, various baseline KD methods can achieve consistent and significant performance gains.
arXiv Detail & Related papers (2024-02-19T07:01:10Z) - DistiLLM: Towards Streamlined Distillation for Large Language Models [53.46759297929675]
DistiLLM is a more effective and efficient KD framework for auto-regressive language models.
DisiLLM comprises two components: (1) a novel skew Kullback-Leibler divergence loss, where we unveil and leverage its theoretical properties, and (2) an adaptive off-policy approach designed to enhance the efficiency in utilizing student-generated outputs.
arXiv Detail & Related papers (2024-02-06T11:10:35Z) - Comparative Knowledge Distillation [102.35425896967791]
Traditional Knowledge Distillation (KD) assumes readily available access to teacher models for frequent inference.
We propose Comparative Knowledge Distillation (CKD), which encourages student models to understand the nuanced differences in a teacher model's interpretations of samples.
CKD consistently outperforms state of the art data augmentation and KD techniques.
arXiv Detail & Related papers (2023-11-03T21:55:33Z) - MixKD: Towards Efficient Distillation of Large-scale Language Models [129.73786264834894]
We propose MixKD, a data-agnostic distillation framework, to endow the resulting model with stronger generalization ability.
We prove from a theoretical perspective that under reasonable conditions MixKD gives rise to a smaller gap between the error and the empirical error.
Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.
arXiv Detail & Related papers (2020-11-01T18:47:51Z) - Pea-KD: Parameter-efficient and Accurate Knowledge Distillation on BERT [20.732095457775138]
Knowledge Distillation (KD) is one of the widely known methods for model compression.
Pea-KD consists of two main parts: Shuffled Sharing (SPS) and Pretraining with Teacher's Predictions (PTP)
arXiv Detail & Related papers (2020-09-30T17:52:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.