Embedding Compression for Teacher-to-Student Knowledge Transfer
- URL: http://arxiv.org/abs/2402.06761v1
- Date: Fri, 9 Feb 2024 19:47:31 GMT
- Title: Embedding Compression for Teacher-to-Student Knowledge Transfer
- Authors: Yiwei Ding and Alexander Lerch
- Abstract summary: We propose to use an embedding compression module with a trainable teacher transformation to obtain a compact teacher embedding.
Results show that adding the embedding compression module improves the classification performance.
Student models trained with the guidance of embeddings show stronger generalizability.
- Score: 58.01044612920816
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Common knowledge distillation methods require the teacher model and the
student model to be trained on the same task. However, the usage of embeddings
as teachers has also been proposed for different source tasks and target tasks.
Prior work that uses embeddings as teachers ignores the fact that the teacher
embeddings are likely to contain irrelevant knowledge for the target task. To
address this problem, we propose to use an embedding compression module with a
trainable teacher transformation to obtain a compact teacher embedding. Results
show that adding the embedding compression module improves the classification
performance, especially for unsupervised teacher embeddings. Moreover, student
models trained with the guidance of embeddings show stronger generalizability.
Related papers
- Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge
Distillation [70.92135839545314]
We propose the dynamic prior knowledge (DPK), which integrates part of teacher's features as the prior knowledge before the feature distillation.
Our DPK makes the performance of the student model positively correlated with that of the teacher model, which means that we can further boost the accuracy of students by applying larger teachers.
arXiv Detail & Related papers (2022-06-13T11:52:13Z) - Generalized Knowledge Distillation via Relationship Matching [53.69235109551099]
Knowledge of a well-trained deep neural network (a.k.a. the "teacher") is valuable for learning similar tasks.
Knowledge distillation extracts knowledge from the teacher and integrates it with the target model.
Instead of enforcing the teacher to work on the same task as the student, we borrow the knowledge from a teacher trained from a general label space.
arXiv Detail & Related papers (2022-05-04T06:49:47Z) - Faculty Distillation with Optimal Transport [53.69235109551099]
We propose to link teacher's task and student's task by optimal transport.
Based on the semantic relationship between their label spaces, we can bridge the support gap between output distributions.
Experiments under various settings demonstrate the succinctness and versatility of our method.
arXiv Detail & Related papers (2022-04-25T09:34:37Z) - Knowledge Distillation with the Reused Teacher Classifier [31.22117343316628]
We show that a simple knowledge distillation technique is enough to significantly narrow down the teacher-student performance gap.
Our technique achieves state-of-the-art results at the modest cost of compression ratio due to the added projector.
arXiv Detail & Related papers (2022-03-26T06:28:46Z) - Representation Consolidation for Training Expert Students [54.90754502493968]
We show that a multi-head, multi-task distillation method is sufficient to consolidate representations from task-specific teacher(s) and improve downstream performance.
Our method can also combine the representational knowledge of multiple teachers trained on one or multiple domains into a single model.
arXiv Detail & Related papers (2021-07-16T17:58:18Z) - Fixing the Teacher-Student Knowledge Discrepancy in Distillation [72.4354883997316]
We propose a novel student-dependent distillation method, knowledge consistent distillation, which makes teacher's knowledge more consistent with the student.
Our method is very flexible that can be easily combined with other state-of-the-art approaches.
arXiv Detail & Related papers (2021-03-31T06:52:20Z) - Distilling Knowledge via Intermediate Classifier Heads [0.5584060970507505]
Knowledge distillation is a transfer-learning approach to train a resource-limited student model with the guide of a pre-trained larger teacher model.
We introduce knowledge distillation via intermediate heads to mitigate the impact of the capacity gap.
Our experiments on various teacher-student pairs and datasets have demonstrated that the proposed approach outperforms the canonical knowledge distillation approach.
arXiv Detail & Related papers (2021-02-28T12:52:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.