Related papers: Understanding the Effects of Projectors in Knowledge Distillation

Understanding the Effects of Projectors in Knowledge Distillation

URL: http://arxiv.org/abs/2310.17183v1
Date: Thu, 26 Oct 2023 06:30:39 GMT
Title: Understanding the Effects of Projectors in Knowledge Distillation
Authors: Yudong Chen, Sen Wang, Jiajun Liu, Xuwei Xu, Frank de Hoog, Brano Kusy, Zi Huang
Abstract summary: Even if the student and the teacher have the same feature dimensions, adding a projector still helps to improve the distillation performance. This paper investigates the implicit role that projectors play but so far have been overlooked. Motivated by the positive effects of projectors, we propose a projector ensemble-based feature distillation method to further improve distillation performance.
Score: 31.882356225974632
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Conventionally, during the knowledge distillation process (e.g. feature distillation), an additional projector is often required to perform feature transformation due to the dimension mismatch between the teacher and the student networks. Interestingly, we discovered that even if the student and the teacher have the same feature dimensions, adding a projector still helps to improve the distillation performance. In addition, projectors even improve logit distillation if we add them to the architecture too. Inspired by these surprising findings and the general lack of understanding of the projectors in the knowledge distillation process from existing literature, this paper investigates the implicit role that projectors play but so far have been overlooked. Our empirical study shows that the student with a projector (1) obtains a better trade-off between the training accuracy and the testing accuracy compared to the student without a projector when it has the same feature dimensions as the teacher, (2) better preserves its similarity to the teacher beyond shallow and numeric resemblance, from the view of Centered Kernel Alignment (CKA), and (3) avoids being over-confident as the teacher does at the testing phase. Motivated by the positive effects of projectors, we propose a projector ensemble-based feature distillation method to further improve distillation performance. Despite the simplicity of the proposed strategy, empirical results from the evaluation of classification tasks on benchmark datasets demonstrate the superior classification performance of our method on a broad range of teacher-student pairs and verify from the aspects of CKA and model calibration that the student's features are of improved quality with the projector ensemble design.

Related papers

Preference-Consistent Knowledge Distillation for Recommender System [4.1752785943044985]
We find that due to the lack of restrictions on projectors, the process of transferring user preferences will likely be interfered with. We propose PCKD, which consists of two regularization terms for projectors. We focus on items with high preference scores and significantly mitigate preference inconsistency, improving the performance of feature-based knowledge distillation.
arXiv Detail & Related papers (2023-11-08T09:31:48Z)
Learning Lightweight Object Detectors via Multi-Teacher Progressive Distillation [56.053397775016755]
We propose a sequential approach to knowledge distillation that progressively transfers the knowledge of a set of teacher detectors to a given lightweight student. To the best of our knowledge, we are the first to successfully distill knowledge from Transformer-based teacher detectors to convolution-based students.
arXiv Detail & Related papers (2023-08-17T17:17:08Z)
Understanding the Role of the Projector in Knowledge Distillation [22.698845243751293]
We revisit the efficacy of knowledge distillation as a function matching and metric learning problem. We verify three important design decisions, namely the normalisation, soft maximum function, and projection layers. We attain a 77.2% top-1 accuracy with DeiT-Ti on ImageNet.
arXiv Detail & Related papers (2023-03-20T13:33:31Z)
Improved Feature Distillation via Projector Ensemble [40.86679028635297]
We propose a new feature distillation method based on a projector ensemble for further performance improvement. We observe that the student network benefits from a projector even if the feature dimensions of the student and the teacher are the same. We propose an ensemble of projectors to further improve the quality of student features.
arXiv Detail & Related papers (2022-10-27T09:08:40Z)
Cross-Architecture Knowledge Distillation [32.689574589575244]
It is natural to distill complementary knowledge from Transformer to convolutional neural network (CNN) To deal with this problem, a novel cross-architecture knowledge distillation method is proposed. The proposed method outperforms 14 state-of-the-arts on both small-scale and large-scale datasets.
arXiv Detail & Related papers (2022-07-12T02:50:48Z)
Knowledge Distillation with the Reused Teacher Classifier [31.22117343316628]
We show that a simple knowledge distillation technique is enough to significantly narrow down the teacher-student performance gap. Our technique achieves state-of-the-art results at the modest cost of compression ratio due to the added projector.
arXiv Detail & Related papers (2022-03-26T06:28:46Z)
Delta Distillation for Efficient Video Processing [68.81730245303591]
We propose a novel knowledge distillation schema coined as Delta Distillation. We demonstrate that these temporal variations can be effectively distilled due to the temporal redundancies within video frames. As a by-product, delta distillation improves the temporal consistency of the teacher model.
arXiv Detail & Related papers (2022-03-17T20:13:30Z)
Distilling Image Classifiers in Object Detectors [81.63849985128527]
We study the case of object detection and, instead of following the standard detector-to-detector distillation approach, introduce a classifier-to-detector knowledge transfer framework. In particular, we propose strategies to exploit the classification teacher to improve both the detector's recognition accuracy and localization performance.
arXiv Detail & Related papers (2021-06-09T16:50:10Z)
Knowledge Distillation Meets Self-Supervision [109.6400639148393]
Knowledge distillation involves extracting "dark knowledge" from a teacher network to guide the learning of a student network. We show that the seemingly different self-supervision task can serve as a simple yet powerful solution. By exploiting the similarity between those self-supervision signals as an auxiliary task, one can effectively transfer the hidden information from the teacher to the student.
arXiv Detail & Related papers (2020-06-12T12:18:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.