Related papers: Similarity Transfer for Knowledge Distillation

Similarity Transfer for Knowledge Distillation

URL: http://arxiv.org/abs/2103.10047v1
Date: Thu, 18 Mar 2021 06:54:59 GMT
Title: Similarity Transfer for Knowledge Distillation
Authors: Haoran Zhao, Kun Gong, Xin Sun, Junyu Dong and Hui Yu
Abstract summary: Knowledge distillation is a popular paradigm for learning portable neural networks by transferring the knowledge from a large model into a smaller one. We propose a novel method called similarity transfer for knowledge distillation (STKD), which aims to fully utilize the similarities between categories of multiple samples. It shows that STKD substantially has outperformed the vanilla knowledge distillation and has achieved superior accuracy over the state-of-the-art knowledge distillation methods.
Score: 25.042405967561212
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Knowledge distillation is a popular paradigm for learning portable neural networks by transferring the knowledge from a large model into a smaller one. Most existing approaches enhance the student model by utilizing the similarity information between the categories of instance level provided by the teacher model. However, these works ignore the similarity correlation between different instances that plays an important role in confidence prediction. To tackle this issue, we propose a novel method in this paper, called similarity transfer for knowledge distillation (STKD), which aims to fully utilize the similarities between categories of multiple samples. Furthermore, we propose to better capture the similarity correlation between different instances by the mixup technique, which creates virtual samples by a weighted linear interpolation. Note that, our distillation loss can fully utilize the incorrect classes similarities by the mixed labels. The proposed approach promotes the performance of student model as the virtual sample created by multiple images produces a similar probability distribution in the teacher and student networks. Experiments and ablation studies on several public classification datasets including CIFAR-10,CIFAR-100,CINIC-10 and Tiny-ImageNet verify that this light-weight method can effectively boost the performance of the compact student model. It shows that STKD substantially has outperformed the vanilla knowledge distillation and has achieved superior accuracy over the state-of-the-art knowledge distillation methods.

Related papers

Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation [64.15918654558816]
Self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only. Experimental results on real-world affective computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods.
arXiv Detail & Related papers (2025-04-19T14:08:56Z)
Preview-based Category Contrastive Learning for Knowledge Distillation [53.551002781828146]
We propose a novel preview-based category contrastive learning method for knowledge distillation (PCKD) It first distills the structural knowledge of both instance-level feature correspondence and the relation between instance features and category centers. It can explicitly optimize the category representation and explore the distinct correlation between representations of instances and categories.
arXiv Detail & Related papers (2024-10-18T03:31:00Z)
CKD: Contrastive Knowledge Distillation from A Sample-wise Perspective [48.99488315273868]
We present a contrastive knowledge distillation approach, which can be formulated as a sample-wise alignment problem with intra- and inter-sample constraints. Our method minimizes logit differences within the same sample by considering their numerical values. We conduct comprehensive experiments on three datasets including CIFAR-100, ImageNet-1K, and MS COCO.
arXiv Detail & Related papers (2024-04-22T11:52:40Z)
Cosine Similarity Knowledge Distillation for Individual Class Information Transfer [11.544799404018473]
We introduce a novel Knowledge Distillation (KD) method capable of achieving results on par with or superior to the teacher models performance. We use cosine similarity, a technique in Natural Language Processing (NLP) for measuring the resemblance between text embeddings. We propose a method called cosine similarity weighted temperature (CSWT) to improve the performance.
arXiv Detail & Related papers (2023-11-24T06:34:47Z)
AICSD: Adaptive Inter-Class Similarity Distillation for Semantic Segmentation [12.92102548320001]
This paper proposes a novel method called Inter-Class Similarity Distillation (ICSD) for the purpose of knowledge distillation. The proposed method transfers high-order relations from the teacher network to the student network by independently computing intra-class distributions for each class from network outputs. Experiments conducted on two well-known datasets for semantic segmentation, Cityscapes and Pascal VOC 2012, validate the effectiveness of the proposed method.
arXiv Detail & Related papers (2023-08-08T13:17:20Z)
CORSD: Class-Oriented Relational Self Distillation [16.11986532440837]
Knowledge distillation conducts an effective model compression method while holding some limitations. We propose a novel training framework named Class-Oriented Self Distillation (CORSD) to address the limitations.
arXiv Detail & Related papers (2023-04-28T16:00:31Z)
EmbedDistill: A Geometric Knowledge Distillation for Information Retrieval [83.79667141681418]
Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR) We propose a novel distillation approach that leverages the relative geometry among queries and documents learned by the large teacher model. We show that our approach successfully distills from both dual-encoder (DE) and cross-encoder (CE) teacher models to 1/10th size asymmetric students that can retain 95-97% of the teacher performance.
arXiv Detail & Related papers (2023-01-27T22:04:37Z)
Contrastive Principal Component Learning: Modeling Similarity by Augmentation Overlap [50.48888534815361]
We propose a novel Contrastive Principal Component Learning (CPCL) method composed of a contrastive-like loss and an on-the-fly projection loss. By CPCL, the learned low-dimensional embeddings theoretically preserve the similarity of augmentation distribution between samples.
arXiv Detail & Related papers (2022-06-01T13:03:58Z)
Multi-head Knowledge Distillation for Model Compression [65.58705111863814]
We propose a simple-to-implement method using auxiliary classifiers at intermediate layers for matching features. We show that the proposed method outperforms prior relevant approaches presented in the literature.
arXiv Detail & Related papers (2020-12-05T00:49:14Z)
Online Knowledge Distillation via Multi-branch Diversity Enhancement [15.523646047674717]
We propose a new distillation method to enhance the diversity among multiple student models. We use Feature Fusion Module (FFM), which improves the performance of the attention mechanism in the network. We also use Diversification(CD) loss function to strengthen the differences between the student models.
arXiv Detail & Related papers (2020-10-02T05:52:12Z)
Triplet Loss for Knowledge Distillation [2.683996597055128]
The purpose of knowledge distillation is to increase the similarity between the teacher model and the student model. In metric learning, the researchers are developing the methods to build a model that can increase the similarity of outputs for similar samples. We think that metric learning can clarify the difference between the different outputs, and the performance of the student model could be improved.
arXiv Detail & Related papers (2020-04-17T08:48:29Z)
Residual Knowledge Distillation [96.18815134719975]
This work proposes Residual Knowledge Distillation (RKD), which further distills the knowledge by introducing an assistant (A) In this way, S is trained to mimic the feature maps of T, and A aids this process by learning the residual error between them. Experiments show that our approach achieves appealing results on popular classification datasets, CIFAR-100 and ImageNet.
arXiv Detail & Related papers (2020-02-21T07:49:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.