Multi-head Knowledge Distillation for Model Compression
- URL: http://arxiv.org/abs/2012.02911v1
- Date: Sat, 5 Dec 2020 00:49:14 GMT
- Title: Multi-head Knowledge Distillation for Model Compression
- Authors: Huan Wang, Suhas Lohit, Michael Jones, Yun Fu
- Abstract summary: We propose a simple-to-implement method using auxiliary classifiers at intermediate layers for matching features.
We show that the proposed method outperforms prior relevant approaches presented in the literature.
- Score: 65.58705111863814
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Several methods of knowledge distillation have been developed for neural
network compression. While they all use the KL divergence loss to align the
soft outputs of the student model more closely with that of the teacher, the
various methods differ in how the intermediate features of the student are
encouraged to match those of the teacher. In this paper, we propose a
simple-to-implement method using auxiliary classifiers at intermediate layers
for matching features, which we refer to as multi-head knowledge distillation
(MHKD). We add loss terms for training the student that measure the
dissimilarity between student and teacher outputs of the auxiliary classifiers.
At the same time, the proposed method also provides a natural way to measure
differences at the intermediate layers even though the dimensions of the
internal teacher and student features may be different. Through several
experiments in image classification on multiple datasets we show that the
proposed method outperforms prior relevant approaches presented in the
literature.
Related papers
- Preview-based Category Contrastive Learning for Knowledge Distillation [53.551002781828146]
We propose a novel preview-based category contrastive learning method for knowledge distillation (PCKD)
It first distills the structural knowledge of both instance-level feature correspondence and the relation between instance features and category centers.
It can explicitly optimize the category representation and explore the distinct correlation between representations of instances and categories.
arXiv Detail & Related papers (2024-10-18T03:31:00Z) - Multi Teacher Privileged Knowledge Distillation for Multimodal Expression Recognition [58.41784639847413]
Human emotion is a complex phenomenon conveyed and perceived through facial expressions, vocal tones, body language, and physiological signals.
In this paper, a multi-teacher PKD (MT-PKDOT) method with self-distillation is introduced to align diverse teacher representations before distilling them to the student.
Results indicate that our proposed method can outperform SOTA PKD methods.
arXiv Detail & Related papers (2024-08-16T22:11:01Z) - I2CKD : Intra- and Inter-Class Knowledge Distillation for Semantic Segmentation [1.433758865948252]
This paper proposes a new knowledge distillation method tailored for image semantic segmentation, termed Intra- and Inter-Class Knowledge Distillation (I2CKD)
The focus of this method is on capturing and transferring knowledge between the intermediate layers of teacher (cumbersome model) and student (compact model)
arXiv Detail & Related papers (2024-03-27T12:05:22Z) - AICSD: Adaptive Inter-Class Similarity Distillation for Semantic
Segmentation [12.92102548320001]
This paper proposes a novel method called Inter-Class Similarity Distillation (ICSD) for the purpose of knowledge distillation.
The proposed method transfers high-order relations from the teacher network to the student network by independently computing intra-class distributions for each class from network outputs.
Experiments conducted on two well-known datasets for semantic segmentation, Cityscapes and Pascal VOC 2012, validate the effectiveness of the proposed method.
arXiv Detail & Related papers (2023-08-08T13:17:20Z) - Knowledge Diffusion for Distillation [53.908314960324915]
The representation gap between teacher and student is an emerging topic in knowledge distillation (KD)
We state that the essence of these methods is to discard the noisy information and distill the valuable information in the feature.
We propose a novel KD method dubbed DiffKD, to explicitly denoise and match features using diffusion models.
arXiv Detail & Related papers (2023-05-25T04:49:34Z) - EmbedDistill: A Geometric Knowledge Distillation for Information
Retrieval [83.79667141681418]
Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR)
We propose a novel distillation approach that leverages the relative geometry among queries and documents learned by the large teacher model.
We show that our approach successfully distills from both dual-encoder (DE) and cross-encoder (CE) teacher models to 1/10th size asymmetric students that can retain 95-97% of the teacher performance.
arXiv Detail & Related papers (2023-01-27T22:04:37Z) - Knowledge Distillation from A Stronger Teacher [44.11781464210916]
This paper presents a method dubbed DIST to distill better from a stronger teacher.
We empirically find that the discrepancy of predictions between the student and a stronger teacher may tend to be fairly severer.
Our method is simple yet practical, and extensive experiments demonstrate that it adapts well to various architectures.
arXiv Detail & Related papers (2022-05-21T08:30:58Z) - Adaptive Distillation: Aggregating Knowledge from Multiple Paths for
Efficient Distillation [15.337420940135704]
Knowledge Distillation is becoming one of the primary trends among neural network compression algorithms.
This paper introduces our proposed adaptive approach based on multitask learning methods.
We empirically demonstrate the effectiveness of the proposed approach over other baselines on the applications of knowledge distillation in classification, semantic segmentation, and object detection tasks.
arXiv Detail & Related papers (2021-10-19T00:57:40Z) - Cross-Layer Distillation with Semantic Calibration [26.59016826651437]
We propose Semantic for Cross-layer Knowledge Distillation (SemCKD), which automatically assigns proper target layers of the teacher model for each student layer.
With a learned attention distribution, each student layer distills knowledge contained in multiple layers rather than a single fixed intermediate layer from the teacher model for appropriate cross-layer supervision in training.
arXiv Detail & Related papers (2020-12-06T11:16:07Z) - There and Back Again: Revisiting Backpropagation Saliency Methods [87.40330595283969]
Saliency methods seek to explain the predictions of a model by producing an importance map across each input sample.
A popular class of such methods is based on backpropagating a signal and analyzing the resulting gradient.
We propose a single framework under which several such methods can be unified.
arXiv Detail & Related papers (2020-04-06T17:58:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.