Contrastive Distillation on Intermediate Representations for Language
Model Compression
- URL: http://arxiv.org/abs/2009.14167v1
- Date: Tue, 29 Sep 2020 17:31:43 GMT
- Title: Contrastive Distillation on Intermediate Representations for Language
Model Compression
- Authors: Siqi Sun, Zhe Gan, Yu Cheng, Yuwei Fang, Shuohang Wang, Jingjing Liu
- Abstract summary: We propose Contrastive Distillation on Intermediate Representations (CoDIR) as a principled knowledge distillation framework.
By learning to distinguish positive sample from a large set of negative samples, CoDIR facilitates the student's exploitation of rich information in teacher's hidden layers.
CoDIR can be readily applied to compress large-scale language models in both pre-training and finetuning stages, and achieves superb performance on the GLUE benchmark.
- Score: 89.31786191358802
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing language model compression methods mostly use a simple L2 loss to
distill knowledge in the intermediate representations of a large BERT model to
a smaller one. Although widely used, this objective by design assumes that all
the dimensions of hidden representations are independent, failing to capture
important structural knowledge in the intermediate layers of the teacher
network. To achieve better distillation efficacy, we propose Contrastive
Distillation on Intermediate Representations (CoDIR), a principled knowledge
distillation framework where the student is trained to distill knowledge
through intermediate layers of the teacher via a contrastive objective. By
learning to distinguish positive sample from a large set of negative samples,
CoDIR facilitates the student's exploitation of rich information in teacher's
hidden layers. CoDIR can be readily applied to compress large-scale language
models in both pre-training and finetuning stages, and achieves superb
performance on the GLUE benchmark, outperforming state-of-the-art compression
methods.
Related papers
- Multi-Granularity Semantic Revision for Large Language Model Distillation [66.03746866578274]
We propose a multi-granularity semantic revision method for LLM distillation.
At the sequence level, we propose a sequence correction and re-generation strategy.
At the token level, we design a distribution adaptive clipping Kullback-Leibler loss as the distillation objective function.
At the span level, we leverage the span priors of a sequence to compute the probability correlations within spans, and constrain the teacher and student's probability correlations to be consistent.
arXiv Detail & Related papers (2024-07-14T03:51:49Z) - I2CKD : Intra- and Inter-Class Knowledge Distillation for Semantic Segmentation [1.433758865948252]
This paper proposes a new knowledge distillation method tailored for image semantic segmentation, termed Intra- and Inter-Class Knowledge Distillation (I2CKD)
The focus of this method is on capturing and transferring knowledge between the intermediate layers of teacher (cumbersome model) and student (compact model)
arXiv Detail & Related papers (2024-03-27T12:05:22Z) - Enhancing Out-of-Distribution Detection in Natural Language
Understanding via Implicit Layer Ensemble [22.643719584452455]
Out-of-distribution (OOD) detection aims to discern outliers from the intended data distribution.
We propose a novel framework based on contrastive learning that encourages intermediate features to learn layer-specialized representations.
Our approach is significantly more effective than other works.
arXiv Detail & Related papers (2022-10-20T06:05:58Z) - Knowledge Distillation Meets Open-Set Semi-Supervised Learning [69.21139647218456]
We propose a novel em modelname (bfem shortname) method dedicated for distilling representational knowledge semantically from a pretrained teacher to a target student.
At the problem level, this establishes an interesting connection between knowledge distillation with open-set semi-supervised learning (SSL)
Our shortname outperforms significantly previous state-of-the-art knowledge distillation methods on both coarse object classification and fine face recognition tasks.
arXiv Detail & Related papers (2022-05-13T15:15:27Z) - Self-Distilled Self-Supervised Representation Learning [35.60243157730165]
State-of-the-art frameworks in self-supervised learning have recently shown that fully utilizing transformer-based models can lead to performance boost.
In our work, we further exploit this by allowing the intermediate representations to learn from the final layers via the contrastive loss.
Our method, Self-Distilled Self-Supervised Learning (SDSSL), outperforms competitive baselines (SimCLR, BYOL and MoCo v3) using ViT on various tasks and datasets.
arXiv Detail & Related papers (2021-11-25T07:52:36Z) - Wasserstein Contrastive Representation Distillation [114.24609306495456]
We propose Wasserstein Contrastive Representation Distillation (WCoRD), which leverages both primal and dual forms of Wasserstein distance for knowledge distillation.
The dual form is used for global knowledge transfer, yielding a contrastive learning objective that maximizes the lower bound of mutual information between the teacher and the student networks.
Experiments demonstrate that the proposed WCoRD method outperforms state-of-the-art approaches on privileged information distillation, model compression and cross-modal transfer.
arXiv Detail & Related papers (2020-12-15T23:43:28Z) - Distilling Dense Representations for Ranking using Tightly-Coupled
Teachers [52.85472936277762]
We apply knowledge distillation to improve the recently proposed late-interaction ColBERT model.
We distill the knowledge from ColBERT's expressive MaxSim operator for computing relevance scores into a simple dot product.
We empirically show that our approach improves query latency and greatly reduces the onerous storage requirements of ColBERT.
arXiv Detail & Related papers (2020-10-22T02:26:01Z) - Deep Semi-supervised Knowledge Distillation for Overlapping Cervical
Cell Instance Segmentation [54.49894381464853]
We propose to leverage both labeled and unlabeled data for instance segmentation with improved accuracy by knowledge distillation.
We propose a novel Mask-guided Mean Teacher framework with Perturbation-sensitive Sample Mining.
Experiments show that the proposed method improves the performance significantly compared with the supervised method learned from labeled data only.
arXiv Detail & Related papers (2020-07-21T13:27:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.