Locally Linear Region Knowledge Distillation
- URL: http://arxiv.org/abs/2010.04812v2
- Date: Mon, 19 Oct 2020 08:47:58 GMT
- Title: Locally Linear Region Knowledge Distillation
- Authors: Xiang Deng and Zhongfei (Mark) Zhang
- Abstract summary: Knowledge distillation (KD) is an effective technique to transfer knowledge from one neural network (teacher) to another (student)
We argue that transferring knowledge at sparse training data points cannot enable the student to well capture the local shape of the teacher function.
We propose locally linear region knowledge distillation ($rm L2$RKD) which transfers the knowledge in local, linear regions from a teacher to a student.
- Score: 5.6592403195043826
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation (KD) is an effective technique to transfer knowledge
from one neural network (teacher) to another (student), thus improving the
performance of the student. To make the student better mimic the behavior of
the teacher, the existing work focuses on designing different criteria to align
their logits or representations. Different from these efforts, we address
knowledge distillation from a novel data perspective. We argue that
transferring knowledge at sparse training data points cannot enable the student
to well capture the local shape of the teacher function. To address this issue,
we propose locally linear region knowledge distillation ($\rm L^2$RKD) which
transfers the knowledge in local, linear regions from a teacher to a student.
This is achieved by enforcing the student to mimic the outputs of the teacher
function in local, linear regions. To the end, the student is able to better
capture the local shape of the teacher function and thus achieves a better
performance. Despite its simplicity, extensive experiments demonstrate that
$\rm L^2$RKD is superior to the original KD in many aspects as it outperforms
KD and the other state-of-the-art approaches by a large margin, shows
robustness and superiority under few-shot settings, and is more compatible with
the existing distillation approaches to further improve their performances
significantly.
Related papers
- Improving Knowledge Distillation with Teacher's Explanation [14.935696904019146]
We introduce a novel Knowledge Explaining Distillation (KED) framework.
KED allows the student to learn not only from the teacher's predictions but also from the teacher's explanations.
Our experiments over a variety of datasets show that KED students can substantially outperform KD students of similar complexity.
arXiv Detail & Related papers (2023-10-04T04:18:01Z) - Cross Architecture Distillation for Face Recognition [49.55061794917994]
We develop an Adaptable Prompting Teacher network (APT) that integrates prompts into the teacher, enabling it to manage distillation-specific knowledge.
Experiments on popular face benchmarks and two large-scale verification sets demonstrate the superiority of our method.
arXiv Detail & Related papers (2023-06-26T12:54:28Z) - Improving Knowledge Distillation via Regularizing Feature Norm and
Direction [16.98806338782858]
Knowledge distillation (KD) exploits a large well-trained model (i.e., teacher) to train a small student model on the same dataset for the same task.
Treating teacher features as knowledge, prevailing methods of knowledge distillation train student by aligning its features with the teacher's, e.g., by minimizing the KL-divergence between their logits or L2 distance between their intermediate features.
While it is natural to believe that better alignment of student features to the teacher better distills teacher knowledge, simply forcing this alignment does not directly contribute to the student's performance, e.g.
arXiv Detail & Related papers (2023-05-26T15:05:19Z) - On effects of Knowledge Distillation on Transfer Learning [0.0]
We propose a machine learning architecture we call TL+KD that combines knowledge distillation with transfer learning.
We show that using guidance and knowledge from a larger teacher network during fine-tuning, we can improve the student network to achieve better validation performances like accuracy.
arXiv Detail & Related papers (2022-10-18T08:11:52Z) - Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge
Distillation [70.92135839545314]
We propose the dynamic prior knowledge (DPK), which integrates part of teacher's features as the prior knowledge before the feature distillation.
Our DPK makes the performance of the student model positively correlated with that of the teacher model, which means that we can further boost the accuracy of students by applying larger teachers.
arXiv Detail & Related papers (2022-06-13T11:52:13Z) - Faculty Distillation with Optimal Transport [53.69235109551099]
We propose to link teacher's task and student's task by optimal transport.
Based on the semantic relationship between their label spaces, we can bridge the support gap between output distributions.
Experiments under various settings demonstrate the succinctness and versatility of our method.
arXiv Detail & Related papers (2022-04-25T09:34:37Z) - Fixing the Teacher-Student Knowledge Discrepancy in Distillation [72.4354883997316]
We propose a novel student-dependent distillation method, knowledge consistent distillation, which makes teacher's knowledge more consistent with the student.
Our method is very flexible that can be easily combined with other state-of-the-art approaches.
arXiv Detail & Related papers (2021-03-31T06:52:20Z) - Distilling Knowledge by Mimicking Features [32.79431807764681]
We argue that it is more advantageous to make the student mimic the teacher's features in the penultimate layer.
Not only the student can directly learn more effective information from the teacher feature, feature mimicking can also be applied for teachers trained without a softmax layer.
arXiv Detail & Related papers (2020-11-03T02:15:14Z) - Role-Wise Data Augmentation for Knowledge Distillation [48.115719640111394]
Knowledge Distillation (KD) is a common method for transferring the knowledge'' learned by one machine learning model into another.
We design data augmentation agents with distinct roles to facilitate knowledge distillation.
We find empirically that specially tailored data points enable the teacher's knowledge to be demonstrated more effectively to the student.
arXiv Detail & Related papers (2020-04-19T14:22:17Z) - Inter-Region Affinity Distillation for Road Marking Segmentation [81.3619453527367]
We study the problem of distilling knowledge from a large deep teacher network to a much smaller student network.
Our method is known as Inter-Region Affinity KD (IntRA-KD)
arXiv Detail & Related papers (2020-04-11T04:26:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.