Improved Feature Distillation via Projector Ensemble
- URL: http://arxiv.org/abs/2210.15274v1
- Date: Thu, 27 Oct 2022 09:08:40 GMT
- Title: Improved Feature Distillation via Projector Ensemble
- Authors: Yudong Chen, Sen Wang, Jiajun Liu, Xuwei Xu, Frank de Hoog, Zi Huang
- Abstract summary: We propose a new feature distillation method based on a projector ensemble for further performance improvement.
We observe that the student network benefits from a projector even if the feature dimensions of the student and the teacher are the same.
We propose an ensemble of projectors to further improve the quality of student features.
- Score: 40.86679028635297
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In knowledge distillation, previous feature distillation methods mainly focus
on the design of loss functions and the selection of the distilled layers,
while the effect of the feature projector between the student and the teacher
remains under-explored. In this paper, we first discuss a plausible mechanism
of the projector with empirical evidence and then propose a new feature
distillation method based on a projector ensemble for further performance
improvement. We observe that the student network benefits from a projector even
if the feature dimensions of the student and the teacher are the same. Training
a student backbone without a projector can be considered as a multi-task
learning process, namely achieving discriminative feature extraction for
classification and feature matching between the student and the teacher for
distillation at the same time. We hypothesize and empirically verify that
without a projector, the student network tends to overfit the teacher's feature
distributions despite having different architecture and weights initialization.
This leads to degradation on the quality of the student's deep features that
are eventually used in classification. Adding a projector, on the other hand,
disentangles the two learning tasks and helps the student network to focus
better on the main feature extraction task while still being able to utilize
teacher features as a guidance through the projector. Motivated by the positive
effect of the projector in feature distillation, we propose an ensemble of
projectors to further improve the quality of student features. Experimental
results on different datasets with a series of teacher-student pairs illustrate
the effectiveness of the proposed method.
Related papers
- Understanding the Effects of Projectors in Knowledge Distillation [31.882356225974632]
Even if the student and the teacher have the same feature dimensions, adding a projector still helps to improve the distillation performance.
This paper investigates the implicit role that projectors play but so far have been overlooked.
Motivated by the positive effects of projectors, we propose a projector ensemble-based feature distillation method to further improve distillation performance.
arXiv Detail & Related papers (2023-10-26T06:30:39Z) - Understanding the Role of the Projector in Knowledge Distillation [22.698845243751293]
We revisit the efficacy of knowledge distillation as a function matching and metric learning problem.
We verify three important design decisions, namely the normalisation, soft maximum function, and projection layers.
We attain a 77.2% top-1 accuracy with DeiT-Ti on ImageNet.
arXiv Detail & Related papers (2023-03-20T13:33:31Z) - Knowledge Distillation with the Reused Teacher Classifier [31.22117343316628]
We show that a simple knowledge distillation technique is enough to significantly narrow down the teacher-student performance gap.
Our technique achieves state-of-the-art results at the modest cost of compression ratio due to the added projector.
arXiv Detail & Related papers (2022-03-26T06:28:46Z) - Delta Distillation for Efficient Video Processing [68.81730245303591]
We propose a novel knowledge distillation schema coined as Delta Distillation.
We demonstrate that these temporal variations can be effectively distilled due to the temporal redundancies within video frames.
As a by-product, delta distillation improves the temporal consistency of the teacher model.
arXiv Detail & Related papers (2022-03-17T20:13:30Z) - Distilling Knowledge via Knowledge Review [69.15050871776552]
We study the factor of connection path cross levels between teacher and student networks, and reveal its great importance.
For the first time in knowledge distillation, cross-stage connection paths are proposed.
Our finally designed nested and compact framework requires negligible overhead, and outperforms other methods on a variety of tasks.
arXiv Detail & Related papers (2021-04-19T04:36:24Z) - Fixing the Teacher-Student Knowledge Discrepancy in Distillation [72.4354883997316]
We propose a novel student-dependent distillation method, knowledge consistent distillation, which makes teacher's knowledge more consistent with the student.
Our method is very flexible that can be easily combined with other state-of-the-art approaches.
arXiv Detail & Related papers (2021-03-31T06:52:20Z) - Differentiable Feature Aggregation Search for Knowledge Distillation [47.94874193183427]
We introduce the feature aggregation to imitate the multi-teacher distillation in the single-teacher distillation framework.
DFA is a two-stage Differentiable Feature Aggregation search method motivated by DARTS in neural architecture search.
Experimental results show that DFA outperforms existing methods on CIFAR-100 and CINIC-10 datasets.
arXiv Detail & Related papers (2020-08-02T15:42:29Z) - Knowledge Distillation Meets Self-Supervision [109.6400639148393]
Knowledge distillation involves extracting "dark knowledge" from a teacher network to guide the learning of a student network.
We show that the seemingly different self-supervision task can serve as a simple yet powerful solution.
By exploiting the similarity between those self-supervision signals as an auxiliary task, one can effectively transfer the hidden information from the teacher to the student.
arXiv Detail & Related papers (2020-06-12T12:18:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.