Similarity Transfer for Knowledge Distillation
- URL: http://arxiv.org/abs/2103.10047v1
- Date: Thu, 18 Mar 2021 06:54:59 GMT
- Title: Similarity Transfer for Knowledge Distillation
- Authors: Haoran Zhao, Kun Gong, Xin Sun, Junyu Dong and Hui Yu
- Abstract summary: Knowledge distillation is a popular paradigm for learning portable neural networks by transferring the knowledge from a large model into a smaller one.
We propose a novel method called similarity transfer for knowledge distillation (STKD), which aims to fully utilize the similarities between categories of multiple samples.
It shows that STKD substantially has outperformed the vanilla knowledge distillation and has achieved superior accuracy over the state-of-the-art knowledge distillation methods.
- Score: 25.042405967561212
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Knowledge distillation is a popular paradigm for learning portable neural
networks by transferring the knowledge from a large model into a smaller one.
Most existing approaches enhance the student model by utilizing the similarity
information between the categories of instance level provided by the teacher
model. However, these works ignore the similarity correlation between different
instances that plays an important role in confidence prediction. To tackle this
issue, we propose a novel method in this paper, called similarity transfer for
knowledge distillation (STKD), which aims to fully utilize the similarities
between categories of multiple samples. Furthermore, we propose to better
capture the similarity correlation between different instances by the mixup
technique, which creates virtual samples by a weighted linear interpolation.
Note that, our distillation loss can fully utilize the incorrect classes
similarities by the mixed labels. The proposed approach promotes the
performance of student model as the virtual sample created by multiple images
produces a similar probability distribution in the teacher and student
networks. Experiments and ablation studies on several public classification
datasets including CIFAR-10,CIFAR-100,CINIC-10 and Tiny-ImageNet verify that
this light-weight method can effectively boost the performance of the compact
student model. It shows that STKD substantially has outperformed the vanilla
knowledge distillation and has achieved superior accuracy over the
state-of-the-art knowledge distillation methods.
Related papers
- Preview-based Category Contrastive Learning for Knowledge Distillation [53.551002781828146]
We propose a novel preview-based category contrastive learning method for knowledge distillation (PCKD)
It first distills the structural knowledge of both instance-level feature correspondence and the relation between instance features and category centers.
It can explicitly optimize the category representation and explore the distinct correlation between representations of instances and categories.
arXiv Detail & Related papers (2024-10-18T03:31:00Z) - CKD: Contrastive Knowledge Distillation from A Sample-wise Perspective [48.99488315273868]
We present a contrastive knowledge distillation approach, which can be formulated as a sample-wise alignment problem with intra- and inter-sample constraints.
Our method minimizes logit differences within the same sample by considering their numerical values.
We conduct comprehensive experiments on three datasets including CIFAR-100, ImageNet-1K, and MS COCO.
arXiv Detail & Related papers (2024-04-22T11:52:40Z) - Cosine Similarity Knowledge Distillation for Individual Class
Information Transfer [11.544799404018473]
We introduce a novel Knowledge Distillation (KD) method capable of achieving results on par with or superior to the teacher models performance.
We use cosine similarity, a technique in Natural Language Processing (NLP) for measuring the resemblance between text embeddings.
We propose a method called cosine similarity weighted temperature (CSWT) to improve the performance.
arXiv Detail & Related papers (2023-11-24T06:34:47Z) - AICSD: Adaptive Inter-Class Similarity Distillation for Semantic
Segmentation [12.92102548320001]
This paper proposes a novel method called Inter-Class Similarity Distillation (ICSD) for the purpose of knowledge distillation.
The proposed method transfers high-order relations from the teacher network to the student network by independently computing intra-class distributions for each class from network outputs.
Experiments conducted on two well-known datasets for semantic segmentation, Cityscapes and Pascal VOC 2012, validate the effectiveness of the proposed method.
arXiv Detail & Related papers (2023-08-08T13:17:20Z) - CORSD: Class-Oriented Relational Self Distillation [16.11986532440837]
Knowledge distillation conducts an effective model compression method while holding some limitations.
We propose a novel training framework named Class-Oriented Self Distillation (CORSD) to address the limitations.
arXiv Detail & Related papers (2023-04-28T16:00:31Z) - EmbedDistill: A Geometric Knowledge Distillation for Information
Retrieval [83.79667141681418]
Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR)
We propose a novel distillation approach that leverages the relative geometry among queries and documents learned by the large teacher model.
We show that our approach successfully distills from both dual-encoder (DE) and cross-encoder (CE) teacher models to 1/10th size asymmetric students that can retain 95-97% of the teacher performance.
arXiv Detail & Related papers (2023-01-27T22:04:37Z) - Contrastive Principal Component Learning: Modeling Similarity by
Augmentation Overlap [50.48888534815361]
We propose a novel Contrastive Principal Component Learning (CPCL) method composed of a contrastive-like loss and an on-the-fly projection loss.
By CPCL, the learned low-dimensional embeddings theoretically preserve the similarity of augmentation distribution between samples.
arXiv Detail & Related papers (2022-06-01T13:03:58Z) - Multi-head Knowledge Distillation for Model Compression [65.58705111863814]
We propose a simple-to-implement method using auxiliary classifiers at intermediate layers for matching features.
We show that the proposed method outperforms prior relevant approaches presented in the literature.
arXiv Detail & Related papers (2020-12-05T00:49:14Z) - Online Knowledge Distillation via Multi-branch Diversity Enhancement [15.523646047674717]
We propose a new distillation method to enhance the diversity among multiple student models.
We use Feature Fusion Module (FFM), which improves the performance of the attention mechanism in the network.
We also use Diversification(CD) loss function to strengthen the differences between the student models.
arXiv Detail & Related papers (2020-10-02T05:52:12Z) - Residual Knowledge Distillation [96.18815134719975]
This work proposes Residual Knowledge Distillation (RKD), which further distills the knowledge by introducing an assistant (A)
In this way, S is trained to mimic the feature maps of T, and A aids this process by learning the residual error between them.
Experiments show that our approach achieves appealing results on popular classification datasets, CIFAR-100 and ImageNet.
arXiv Detail & Related papers (2020-02-21T07:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.