Grouped Knowledge Distillation for Deep Face Recognition
- URL: http://arxiv.org/abs/2304.04462v1
- Date: Mon, 10 Apr 2023 09:04:38 GMT
- Title: Grouped Knowledge Distillation for Deep Face Recognition
- Authors: Weisong Zhao, Xiangyu Zhu, Kaiwen Guo, Xiao-Yu Zhang, Zhen Lei
- Abstract summary: The light-weight student network has difficulty fitting the target logits due to its low model capacity.
We propose a Grouped Knowledge Distillation (GKD) that retains the Primary-KD and Binary-KD but omits Secondary-KD in the ultimate KD loss calculation.
- Score: 53.57402723008569
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Compared with the feature-based distillation methods, logits distillation can
liberalize the requirements of consistent feature dimension between teacher and
student networks, while the performance is deemed inferior in face recognition.
One major challenge is that the light-weight student network has difficulty
fitting the target logits due to its low model capacity, which is attributed to
the significant number of identities in face recognition. Therefore, we seek to
probe the target logits to extract the primary knowledge related to face
identity, and discard the others, to make the distillation more achievable for
the student network. Specifically, there is a tail group with near-zero values
in the prediction, containing minor knowledge for distillation. To provide a
clear perspective of its impact, we first partition the logits into two groups,
i.e., Primary Group and Secondary Group, according to the cumulative
probability of the softened prediction. Then, we reorganize the Knowledge
Distillation (KD) loss of grouped logits into three parts, i.e., Primary-KD,
Secondary-KD, and Binary-KD. Primary-KD refers to distilling the primary
knowledge from the teacher, Secondary-KD aims to refine minor knowledge but
increases the difficulty of distillation, and Binary-KD ensures the consistency
of knowledge distribution between teacher and student. We experimentally found
that (1) Primary-KD and Binary-KD are indispensable for KD, and (2)
Secondary-KD is the culprit restricting KD at the bottleneck. Therefore, we
propose a Grouped Knowledge Distillation (GKD) that retains the Primary-KD and
Binary-KD but omits Secondary-KD in the ultimate KD loss calculation. Extensive
experimental results on popular face recognition benchmarks demonstrate the
superiority of proposed GKD over state-of-the-art methods.
Related papers
- Relative Difficulty Distillation for Semantic Segmentation [54.76143187709987]
We propose a pixel-level KD paradigm for semantic segmentation named Relative Difficulty Distillation (RDD)
RDD allows the teacher network to provide effective guidance on learning focus without additional optimization goals.
Our research showcases that RDD can integrate with existing KD methods to improve their upper performance bound.
arXiv Detail & Related papers (2024-07-04T08:08:25Z) - AdaDistill: Adaptive Knowledge Distillation for Deep Face Recognition [8.045296450065019]
Knowledge distillation aims at improving the performance of a compact student model by distilling the knowledge from a high-performing teacher model.
AdaDistill embeds the KD concept into the softmax loss by training the student using a margin penalty softmax loss with distilled class centers from the teacher.
Extensive experiments and ablation studies show that AdaDistill can enhance the discriminative learning capability of the student.
arXiv Detail & Related papers (2024-07-01T14:39:55Z) - CrossKD: Cross-Head Knowledge Distillation for Object Detection [69.16346256926842]
Knowledge Distillation (KD) has been validated as an effective model compression technique for learning compact object detectors.
We present a prediction mimicking distillation scheme, called CrossKD, which delivers the intermediate features of the student's detection head to the teacher's detection head.
Our CrossKD boosts the average precision of GFL ResNet-50 with 1x training schedule from 40.2 to 43.7, outperforming all existing KD methods.
arXiv Detail & Related papers (2023-06-20T08:19:51Z) - Gradient-Guided Knowledge Distillation for Object Detectors [3.236217153362305]
We propose a novel approach for knowledge distillation in object detection, named Gradient-guided Knowledge Distillation (GKD)
Our GKD uses gradient information to identify and assign more weights to features that significantly impact the detection loss, allowing the student to learn the most relevant features from the teacher.
Experiments on the KITTI and COCO-Traffic datasets demonstrate our method's efficacy in knowledge distillation for object detection.
arXiv Detail & Related papers (2023-03-07T21:09:09Z) - Unbiased Knowledge Distillation for Recommendation [66.82575287129728]
Knowledge distillation (KD) has been applied in recommender systems (RS) to reduce inference latency.
Traditional solutions first train a full teacher model from the training data, and then transfer its knowledge to supervise the learning of a compact student model.
We find such a standard distillation paradigm would incur serious bias issue -- popular items are more heavily recommended after the distillation.
arXiv Detail & Related papers (2022-11-27T05:14:03Z) - Decoupled Knowledge Distillation [7.049113958508325]
We reformulate the classical KD loss into two parts, i.e., target class knowledge distillation (TCKD) and non-target class knowledge distillation (NCKD)
TCKD transfers knowledge concerning the "difficulty" of training samples, while NCKD is the prominent reason why logit distillation works.
We present Decoupled Knowledge Distillation (DKD), enabling TCKD and NCKD to play their roles more efficiently and flexibly.
arXiv Detail & Related papers (2022-03-16T15:07:47Z) - Orderly Dual-Teacher Knowledge Distillation for Lightweight Human Pose
Estimation [1.0323063834827415]
We propose an orderly dual-teacher knowledge distillation (ODKD) framework, which consists of two teachers with different capabilities.
Taking dual-teacher together, an orderly learning strategy is proposed to promote knowledge absorbability.
Our proposed ODKD can improve the performance of different lightweight models by a large margin, and HRNet-W16 equipped with ODKD achieves state-of-the-art performance for lightweight human pose estimation.
arXiv Detail & Related papers (2021-04-21T08:50:36Z) - Wasserstein Contrastive Representation Distillation [114.24609306495456]
We propose Wasserstein Contrastive Representation Distillation (WCoRD), which leverages both primal and dual forms of Wasserstein distance for knowledge distillation.
The dual form is used for global knowledge transfer, yielding a contrastive learning objective that maximizes the lower bound of mutual information between the teacher and the student networks.
Experiments demonstrate that the proposed WCoRD method outperforms state-of-the-art approaches on privileged information distillation, model compression and cross-modal transfer.
arXiv Detail & Related papers (2020-12-15T23:43:28Z) - Knowledge Distillation Thrives on Data Augmentation [65.58705111863814]
Knowledge distillation (KD) is a general deep neural network training framework that uses a teacher model to guide a student model.
Many works have explored the rationale for its success, however, its interplay with data augmentation (DA) has not been well recognized so far.
In this paper, we are motivated by an interesting observation in classification: KD loss can benefit from extended training iterations while the cross-entropy loss does not.
We show this disparity arises because of data augmentation: KD loss can tap into the extra information from different input views brought by DA.
arXiv Detail & Related papers (2020-12-05T00:32:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.