Improving Knowledge Distillation via Regularizing Feature Norm and
Direction
- URL: http://arxiv.org/abs/2305.17007v1
- Date: Fri, 26 May 2023 15:05:19 GMT
- Title: Improving Knowledge Distillation via Regularizing Feature Norm and
Direction
- Authors: Yuzhu Wang, Lechao Cheng, Manni Duan, Yongheng Wang, Zunlei Feng, Shu
Kong
- Abstract summary: Knowledge distillation (KD) exploits a large well-trained model (i.e., teacher) to train a small student model on the same dataset for the same task.
Treating teacher features as knowledge, prevailing methods of knowledge distillation train student by aligning its features with the teacher's, e.g., by minimizing the KL-divergence between their logits or L2 distance between their intermediate features.
While it is natural to believe that better alignment of student features to the teacher better distills teacher knowledge, simply forcing this alignment does not directly contribute to the student's performance, e.g.
- Score: 16.98806338782858
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Knowledge distillation (KD) exploits a large well-trained model (i.e.,
teacher) to train a small student model on the same dataset for the same task.
Treating teacher features as knowledge, prevailing methods of knowledge
distillation train student by aligning its features with the teacher's, e.g.,
by minimizing the KL-divergence between their logits or L2 distance between
their intermediate features. While it is natural to believe that better
alignment of student features to the teacher better distills teacher knowledge,
simply forcing this alignment does not directly contribute to the student's
performance, e.g., classification accuracy. In this work, we propose to align
student features with class-mean of teacher features, where class-mean
naturally serves as a strong classifier. To this end, we explore baseline
techniques such as adopting the cosine distance based loss to encourage the
similarity between student features and their corresponding class-means of the
teacher. Moreover, we train the student to produce large-norm features,
inspired by other lines of work (e.g., model pruning and domain adaptation),
which find the large-norm features to be more significant. Finally, we propose
a rather simple loss term (dubbed ND loss) to simultaneously (1) encourage
student to produce large-\emph{norm} features, and (2) align the
\emph{direction} of student features and teacher class-means. Experiments on
standard benchmarks demonstrate that our explored techniques help existing KD
methods achieve better performance, i.e., higher classification accuracy on
ImageNet and CIFAR100 datasets, and higher detection precision on COCO dataset.
Importantly, our proposed ND loss helps the most, leading to the
state-of-the-art performance on these benchmarks. The source code is available
at \url{https://github.com/WangYZ1608/Knowledge-Distillation-via-ND}.
Related papers
- Knowledge Diffusion for Distillation [53.908314960324915]
The representation gap between teacher and student is an emerging topic in knowledge distillation (KD)
We state that the essence of these methods is to discard the noisy information and distill the valuable information in the feature.
We propose a novel KD method dubbed DiffKD, to explicitly denoise and match features using diffusion models.
arXiv Detail & Related papers (2023-05-25T04:49:34Z) - Improved knowledge distillation by utilizing backward pass knowledge in
neural networks [17.437510399431606]
Knowledge distillation (KD) is one of the prominent techniques for model compression.
In this work, we generate new auxiliary training samples based on extracting knowledge from the backward pass of the teacher.
We show how this technique can be used successfully in applications of natural language processing (NLP) and language understanding.
arXiv Detail & Related papers (2023-01-27T22:07:38Z) - EmbedDistill: A Geometric Knowledge Distillation for Information
Retrieval [83.79667141681418]
Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR)
We propose a novel distillation approach that leverages the relative geometry among queries and documents learned by the large teacher model.
We show that our approach successfully distills from both dual-encoder (DE) and cross-encoder (CE) teacher models to 1/10th size asymmetric students that can retain 95-97% of the teacher performance.
arXiv Detail & Related papers (2023-01-27T22:04:37Z) - Exploring Inconsistent Knowledge Distillation for Object Detection with
Data Augmentation [66.25738680429463]
Knowledge Distillation (KD) for object detection aims to train a compact detector by transferring knowledge from a teacher model.
We propose inconsistent knowledge distillation (IKD) which aims to distill knowledge inherent in the teacher model's counter-intuitive perceptions.
Our method outperforms state-of-the-art KD baselines on one-stage, two-stage and anchor-free object detectors.
arXiv Detail & Related papers (2022-09-20T16:36:28Z) - Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge
Distillation [70.92135839545314]
We propose the dynamic prior knowledge (DPK), which integrates part of teacher's features as the prior knowledge before the feature distillation.
Our DPK makes the performance of the student model positively correlated with that of the teacher model, which means that we can further boost the accuracy of students by applying larger teachers.
arXiv Detail & Related papers (2022-06-13T11:52:13Z) - Prediction-Guided Distillation for Dense Object Detection [7.5320132424481505]
We show that only a very small fraction of features within a ground-truth bounding box are responsible for a teacher's high detection performance.
We propose Prediction-Guided Distillation (PGD), which focuses distillation on these key predictive regions of the teacher.
Our proposed approach outperforms current state-of-the-art KD baselines on a variety of advanced one-stage detection architectures.
arXiv Detail & Related papers (2022-03-10T16:46:05Z) - Knowledge Distillation for Object Detection via Rank Mimicking and
Prediction-guided Feature Imitation [34.441349114336994]
We propose Rank Mimicking (RM) and Prediction-guided Feature Imitation (PFI) for distilling one-stage detectors.
RM takes the rank of candidate boxes from teachers as a new form of knowledge to distill.
PFI attempts to correlate feature differences with prediction differences, making feature imitation directly help to improve the student's accuracy.
arXiv Detail & Related papers (2021-12-09T11:19:15Z) - SLADE: A Self-Training Framework For Distance Metric Learning [75.54078592084217]
We present a self-training framework, SLADE, to improve retrieval performance by leveraging additional unlabeled data.
We first train a teacher model on the labeled data and use it to generate pseudo labels for the unlabeled data.
We then train a student model on both labels and pseudo labels to generate final feature embeddings.
arXiv Detail & Related papers (2020-11-20T08:26:10Z) - Distilling Knowledge by Mimicking Features [32.79431807764681]
We argue that it is more advantageous to make the student mimic the teacher's features in the penultimate layer.
Not only the student can directly learn more effective information from the teacher feature, feature mimicking can also be applied for teachers trained without a softmax layer.
arXiv Detail & Related papers (2020-11-03T02:15:14Z) - ProxylessKD: Direct Knowledge Distillation with Inherited Classifier for
Face Recognition [84.49978494275382]
Knowledge Distillation (KD) refers to transferring knowledge from a large model to a smaller one.
In this work, we focus on its application in face recognition.
We propose a novel method named ProxylessKD that directly optimize face recognition accuracy.
arXiv Detail & Related papers (2020-10-31T13:14:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.