Leveraging Angular Distributions for Improved Knowledge Distillation
- URL: http://arxiv.org/abs/2302.14130v1
- Date: Mon, 27 Feb 2023 20:34:30 GMT
- Title: Leveraging Angular Distributions for Improved Knowledge Distillation
- Authors: Eun Som Jeon, Hongjun Choi, Ankita Shukla, Pavan Turaga
- Abstract summary: In computer vision applications, it is seen that the feature activation learned by a higher capacity model contains richer knowledge, highlighting complete objects while focusing less on the background.
We propose a new loss function for distillation, called angular margin-based distillation (AMD) loss.
We show that the proposed method has advantages in compatibility with other learning techniques, such as using fine-grained features, augmentation, and other distillation methods.
- Score: 4.751886527142779
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Knowledge distillation as a broad class of methods has led to the development
of lightweight and memory efficient models, using a pre-trained model with a
large capacity (teacher network) to train a smaller model (student network).
Recently, additional variations for knowledge distillation, utilizing
activation maps of intermediate layers as the source of knowledge, have been
studied. Generally, in computer vision applications, it is seen that the
feature activation learned by a higher capacity model contains richer
knowledge, highlighting complete objects while focusing less on the background.
Based on this observation, we leverage the dual ability of the teacher to
accurately distinguish between positive (relevant to the target object) and
negative (irrelevant) areas. We propose a new loss function for distillation,
called angular margin-based distillation (AMD) loss. AMD loss uses the angular
distance between positive and negative features by projecting them onto a
hypersphere, motivated by the near angular distributions seen in many feature
extractors. Then, we create a more attentive feature that is angularly
distributed on the hypersphere by introducing an angular margin to the positive
feature. Transferring such knowledge from the teacher network enables the
student model to harness the higher discrimination of positive and negative
features for the teacher, thus distilling superior student models. The proposed
method is evaluated for various student-teacher network pairs on four public
datasets. Furthermore, we show that the proposed method has advantages in
compatibility with other learning techniques, such as using fine-grained
features, augmentation, and other distillation methods.
Related papers
- Linear Projections of Teacher Embeddings for Few-Class Distillation [14.99228980898161]
Knowledge Distillation (KD) has emerged as a promising approach for transferring knowledge from a larger, more complex teacher model to a smaller student model.
We introduce a novel method for distilling knowledge from the teacher's model representations, which we term Learning Embedding Linear Projections (LELP)
Our experimental evaluation on large-scale NLP benchmarks like Amazon Reviews and Sentiment140 demonstrate the LELP is consistently competitive with, and typically superior to, existing state-of-the-art distillation algorithms for binary and few-class problems.
arXiv Detail & Related papers (2024-09-30T16:07:34Z) - Knowledge Distillation for Road Detection based on cross-model Semi-Supervised Learning [17.690698736544626]
We propose an integrated approach that combines knowledge distillation and semi-supervised learning methods.
This hybrid approach leverages the robust capabilities of large models to effectively utilise large unlabelled data.
The proposed semi-supervised learning-based knowledge distillation (SSLKD) approach demonstrates a notable improvement in the performance of the student model.
arXiv Detail & Related papers (2024-02-07T22:50:47Z) - Knowledge Diffusion for Distillation [53.908314960324915]
The representation gap between teacher and student is an emerging topic in knowledge distillation (KD)
We state that the essence of these methods is to discard the noisy information and distill the valuable information in the feature.
We propose a novel KD method dubbed DiffKD, to explicitly denoise and match features using diffusion models.
arXiv Detail & Related papers (2023-05-25T04:49:34Z) - EmbedDistill: A Geometric Knowledge Distillation for Information
Retrieval [83.79667141681418]
Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR)
We propose a novel distillation approach that leverages the relative geometry among queries and documents learned by the large teacher model.
We show that our approach successfully distills from both dual-encoder (DE) and cross-encoder (CE) teacher models to 1/10th size asymmetric students that can retain 95-97% of the teacher performance.
arXiv Detail & Related papers (2023-01-27T22:04:37Z) - Exploring Inconsistent Knowledge Distillation for Object Detection with
Data Augmentation [66.25738680429463]
Knowledge Distillation (KD) for object detection aims to train a compact detector by transferring knowledge from a teacher model.
We propose inconsistent knowledge distillation (IKD) which aims to distill knowledge inherent in the teacher model's counter-intuitive perceptions.
Our method outperforms state-of-the-art KD baselines on one-stage, two-stage and anchor-free object detectors.
arXiv Detail & Related papers (2022-09-20T16:36:28Z) - Knowledge Distillation Meets Open-Set Semi-Supervised Learning [69.21139647218456]
We propose a novel em modelname (bfem shortname) method dedicated for distilling representational knowledge semantically from a pretrained teacher to a target student.
At the problem level, this establishes an interesting connection between knowledge distillation with open-set semi-supervised learning (SSL)
Our shortname outperforms significantly previous state-of-the-art knowledge distillation methods on both coarse object classification and fine face recognition tasks.
arXiv Detail & Related papers (2022-05-13T15:15:27Z) - Extracting knowledge from features with multilevel abstraction [3.4443503349903124]
Self-knowledge distillation (SKD) aims at transferring the knowledge from a large teacher model to a small student model.
In this paper, we purpose a novel SKD method in a different way from the main stream methods.
Experiments and ablation studies show its great effectiveness and generalization on various kinds of tasks.
arXiv Detail & Related papers (2021-12-04T02:25:46Z) - Visualizing the embedding space to explain the effect of knowledge
distillation [5.678337324555035]
Recent research has found that knowledge distillation can be effective in reducing the size of a network.
Despite these advances, it still is relatively unclear emphwhy this method works, that is, what the resulting student model does 'better'
arXiv Detail & Related papers (2021-10-09T07:04:26Z) - Knowledge Distillation Meets Self-Supervision [109.6400639148393]
Knowledge distillation involves extracting "dark knowledge" from a teacher network to guide the learning of a student network.
We show that the seemingly different self-supervision task can serve as a simple yet powerful solution.
By exploiting the similarity between those self-supervision signals as an auxiliary task, one can effectively transfer the hidden information from the teacher to the student.
arXiv Detail & Related papers (2020-06-12T12:18:52Z) - Why distillation helps: a statistical perspective [69.90148901064747]
Knowledge distillation is a technique for improving the performance of a simple "student" model.
While this simple approach has proven widely effective, a basic question remains unresolved: why does distillation help?
We show how distillation complements existing negative mining techniques for extreme multiclass retrieval.
arXiv Detail & Related papers (2020-05-21T01:49:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.