Triplet Knowledge Distillation
- URL: http://arxiv.org/abs/2305.15975v1
- Date: Thu, 25 May 2023 12:12:31 GMT
- Title: Triplet Knowledge Distillation
- Authors: Xijun Wang, Dongyang Liu, Meina Kan, Chunrui Han, Zhongqin Wu,
Shiguang Shan
- Abstract summary: In Knowledge Distillation, the teacher is generally much larger than the student, making the solution of the teacher likely to be difficult for the student to learn.
To ease the mimicking difficulty, we introduce a triplet knowledge distillation mechanism named TriKD.
- Score: 73.39109022280878
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In Knowledge Distillation, the teacher is generally much larger than the
student, making the solution of the teacher likely to be difficult for the
student to learn. To ease the mimicking difficulty, we introduce a triplet
knowledge distillation mechanism named TriKD. Besides teacher and student,
TriKD employs a third role called anchor model. Before distillation begins, the
pre-trained anchor model delimits a subspace within the full solution space of
the target problem. Solutions within the subspace are expected to be easy
targets that the student could mimic well. Distillation then begins in an
online manner, and the teacher is only allowed to express solutions within the
aforementioned subspace. Surprisingly, benefiting from accurate but
easy-to-mimic hints, the student can finally perform well. After the student is
well trained, it can be used as the new anchor for new students, forming a
curriculum learning strategy. Our experiments on image classification and face
recognition with various models clearly demonstrate the effectiveness of our
method. Furthermore, the proposed TriKD is also effective in dealing with the
overfitting issue. Moreover, our theoretical analysis supports the rationality
of our triplet distillation.
Related papers
- Teaching What You Should Teach: A Data-Based Distillation Method [20.595460553747163]
We introduce the "Teaching what you Should Teach" strategy into a knowledge distillation framework.
We propose a data-based distillation method named "TST" that searches for desirable augmented samples to assist in distilling more efficiently and rationally.
To be specific, we design a neural network-based data augmentation module with priori bias, which assists in finding what meets the teacher's strengths but the student's weaknesses.
arXiv Detail & Related papers (2022-12-11T06:22:14Z) - Unbiased Knowledge Distillation for Recommendation [66.82575287129728]
Knowledge distillation (KD) has been applied in recommender systems (RS) to reduce inference latency.
Traditional solutions first train a full teacher model from the training data, and then transfer its knowledge to supervise the learning of a compact student model.
We find such a standard distillation paradigm would incur serious bias issue -- popular items are more heavily recommended after the distillation.
arXiv Detail & Related papers (2022-11-27T05:14:03Z) - Pro-KD: Progressive Distillation by Following the Footsteps of the
Teacher [5.010360359434596]
Pro-KD technique defines a smoother training path for the student by following the training footprints of the teacher.
We demonstrate our technique is quite effective in mitigating the capacity-gap problem and the checkpoint search problem.
arXiv Detail & Related papers (2021-10-16T09:49:43Z) - Learning to Teach with Student Feedback [67.41261090761834]
Interactive Knowledge Distillation (IKD) allows the teacher to learn to teach from the feedback of the student.
IKD trains the teacher model to generate specific soft target at each training step for a certain student.
Joint optimization for both teacher and student is achieved by two iterative steps.
arXiv Detail & Related papers (2021-09-10T03:01:01Z) - Fixing the Teacher-Student Knowledge Discrepancy in Distillation [72.4354883997316]
We propose a novel student-dependent distillation method, knowledge consistent distillation, which makes teacher's knowledge more consistent with the student.
Our method is very flexible that can be easily combined with other state-of-the-art approaches.
arXiv Detail & Related papers (2021-03-31T06:52:20Z) - ALP-KD: Attention-Based Layer Projection for Knowledge Distillation [30.896957367331137]
Two neural networks, namely a teacher and a student, are coupled together during training.
The teacher network is supposed to be a trustworthy predictor and the student tries to mimic its predictions.
In such a setting, distillation only happens for final predictions whereas the student could also benefit from teacher's supervision for internal components.
arXiv Detail & Related papers (2020-12-27T22:30:13Z) - Reducing the Teacher-Student Gap via Spherical Knowledge Disitllation [67.75526580926149]
Knowledge distillation aims at obtaining a compact and effective model by learning the mapping function from a much larger one.
We investigate the capacity gap problem by study the gap of confidence between teacher and student.
We find that the magnitude of confidence is not necessary for knowledge distillation and could harm the student performance if the student are forced to learn confidence.
arXiv Detail & Related papers (2020-10-15T03:03:36Z) - Distilling Object Detectors with Task Adaptive Regularization [97.52935611385179]
Current state-of-the-art object detectors are at the expense of high computational costs and are hard to deploy to low-end devices.
Knowledge distillation, which aims at training a smaller student network by transferring knowledge from a larger teacher model, is one of the promising solutions for model miniaturization.
arXiv Detail & Related papers (2020-06-23T15:58:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.