Cosine Similarity Knowledge Distillation for Individual Class
Information Transfer
- URL: http://arxiv.org/abs/2311.14307v1
- Date: Fri, 24 Nov 2023 06:34:47 GMT
- Title: Cosine Similarity Knowledge Distillation for Individual Class
Information Transfer
- Authors: Gyeongdo Ham, Seonghak Kim, Suin Lee, Jae-Hyeok Lee, and Daeshik Kim
- Abstract summary: We introduce a novel Knowledge Distillation (KD) method capable of achieving results on par with or superior to the teacher models performance.
We use cosine similarity, a technique in Natural Language Processing (NLP) for measuring the resemblance between text embeddings.
We propose a method called cosine similarity weighted temperature (CSWT) to improve the performance.
- Score: 11.544799404018473
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Previous logits-based Knowledge Distillation (KD) have utilized predictions
about multiple categories within each sample (i.e., class predictions) and have
employed Kullback-Leibler (KL) divergence to reduce the discrepancy between the
student and teacher predictions. Despite the proliferation of KD techniques,
the student model continues to fall short of achieving a similar level as
teachers. In response, we introduce a novel and effective KD method capable of
achieving results on par with or superior to the teacher models performance. We
utilize teacher and student predictions about multiple samples for each
category (i.e., batch predictions) and apply cosine similarity, a commonly used
technique in Natural Language Processing (NLP) for measuring the resemblance
between text embeddings. This metric's inherent scale-invariance property,
which relies solely on vector direction and not magnitude, allows the student
to dynamically learn from the teacher's knowledge, rather than being bound by a
fixed distribution of the teacher's knowledge. Furthermore, we propose a method
called cosine similarity weighted temperature (CSWT) to improve the
performance. CSWT reduces the temperature scaling in KD when the cosine
similarity between the student and teacher models is high, and conversely, it
increases the temperature scaling when the cosine similarity is low. This
adjustment optimizes the transfer of information from the teacher to the
student model. Extensive experimental results show that our proposed method
serves as a viable alternative to existing methods. We anticipate that this
approach will offer valuable insights for future research on model compression.
Related papers
- Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling [81.00825302340984]
We introduce Speculative Knowledge Distillation (SKD) to generate high-quality training data on-the-fly.
In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution.
We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following.
arXiv Detail & Related papers (2024-10-15T06:51:25Z) - CKD: Contrastive Knowledge Distillation from A Sample-wise Perspective [48.99488315273868]
We present a contrastive knowledge distillation approach, which can be formulated as a sample-wise alignment problem with intra- and inter-sample constraints.
Our method minimizes logit differences within the same sample by considering their numerical values.
We conduct comprehensive experiments on three datasets including CIFAR-100, ImageNet-1K, and MS COCO.
arXiv Detail & Related papers (2024-04-22T11:52:40Z) - Improving Knowledge Distillation via Regularizing Feature Norm and
Direction [16.98806338782858]
Knowledge distillation (KD) exploits a large well-trained model (i.e., teacher) to train a small student model on the same dataset for the same task.
Treating teacher features as knowledge, prevailing methods of knowledge distillation train student by aligning its features with the teacher's, e.g., by minimizing the KL-divergence between their logits or L2 distance between their intermediate features.
While it is natural to believe that better alignment of student features to the teacher better distills teacher knowledge, simply forcing this alignment does not directly contribute to the student's performance, e.g.
arXiv Detail & Related papers (2023-05-26T15:05:19Z) - Improved knowledge distillation by utilizing backward pass knowledge in
neural networks [17.437510399431606]
Knowledge distillation (KD) is one of the prominent techniques for model compression.
In this work, we generate new auxiliary training samples based on extracting knowledge from the backward pass of the teacher.
We show how this technique can be used successfully in applications of natural language processing (NLP) and language understanding.
arXiv Detail & Related papers (2023-01-27T22:07:38Z) - Exploring Inconsistent Knowledge Distillation for Object Detection with
Data Augmentation [66.25738680429463]
Knowledge Distillation (KD) for object detection aims to train a compact detector by transferring knowledge from a teacher model.
We propose inconsistent knowledge distillation (IKD) which aims to distill knowledge inherent in the teacher model's counter-intuitive perceptions.
Our method outperforms state-of-the-art KD baselines on one-stage, two-stage and anchor-free object detectors.
arXiv Detail & Related papers (2022-09-20T16:36:28Z) - Knowledge Distillation from A Stronger Teacher [44.11781464210916]
This paper presents a method dubbed DIST to distill better from a stronger teacher.
We empirically find that the discrepancy of predictions between the student and a stronger teacher may tend to be fairly severer.
Our method is simple yet practical, and extensive experiments demonstrate that it adapts well to various architectures.
arXiv Detail & Related papers (2022-05-21T08:30:58Z) - Similarity Transfer for Knowledge Distillation [25.042405967561212]
Knowledge distillation is a popular paradigm for learning portable neural networks by transferring the knowledge from a large model into a smaller one.
We propose a novel method called similarity transfer for knowledge distillation (STKD), which aims to fully utilize the similarities between categories of multiple samples.
It shows that STKD substantially has outperformed the vanilla knowledge distillation and has achieved superior accuracy over the state-of-the-art knowledge distillation methods.
arXiv Detail & Related papers (2021-03-18T06:54:59Z) - Multi-head Knowledge Distillation for Model Compression [65.58705111863814]
We propose a simple-to-implement method using auxiliary classifiers at intermediate layers for matching features.
We show that the proposed method outperforms prior relevant approaches presented in the literature.
arXiv Detail & Related papers (2020-12-05T00:49:14Z) - MixKD: Towards Efficient Distillation of Large-scale Language Models [129.73786264834894]
We propose MixKD, a data-agnostic distillation framework, to endow the resulting model with stronger generalization ability.
We prove from a theoretical perspective that under reasonable conditions MixKD gives rise to a smaller gap between the error and the empirical error.
Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.
arXiv Detail & Related papers (2020-11-01T18:47:51Z) - Knowledge distillation via adaptive instance normalization [52.91164959767517]
We propose a new knowledge distillation method based on transferring feature statistics from the teacher to the student.
Our method goes beyond the standard way of enforcing the mean and variance of the student to be similar to those of the teacher.
We show that our distillation method outperforms other state-of-the-art distillation methods over a large set of experimental settings.
arXiv Detail & Related papers (2020-03-09T17:50:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.