Cross-Layer Distillation with Semantic Calibration
- URL: http://arxiv.org/abs/2012.03236v1
- Date: Sun, 6 Dec 2020 11:16:07 GMT
- Title: Cross-Layer Distillation with Semantic Calibration
- Authors: Defang Chen, Jian-Ping Mei, Yuan Zhang, Can Wang, Zhe Wang, Yan Feng,
Chun Chen
- Abstract summary: We propose Semantic for Cross-layer Knowledge Distillation (SemCKD), which automatically assigns proper target layers of the teacher model for each student layer.
With a learned attention distribution, each student layer distills knowledge contained in multiple layers rather than a single fixed intermediate layer from the teacher model for appropriate cross-layer supervision in training.
- Score: 26.59016826651437
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently proposed knowledge distillation approaches based on feature-map
transfer validate that intermediate layers of a teacher model can serve as
effective targets for training a student model to obtain better generalization
ability. Existing studies mainly focus on particular representation forms for
knowledge transfer between manually specified pairs of teacher-student
intermediate layers. However, semantics of intermediate layers may vary in
different networks and manual association of layers might lead to negative
regularization caused by semantic mismatch between certain teacher-student
layer pairs. To address this problem, we propose Semantic Calibration for
Cross-layer Knowledge Distillation (SemCKD), which automatically assigns proper
target layers of the teacher model for each student layer with an attention
mechanism. With a learned attention distribution, each student layer distills
knowledge contained in multiple layers rather than a single fixed intermediate
layer from the teacher model for appropriate cross-layer supervision in
training. Consistent improvements over state-of-the-art approaches are observed
in extensive experiments with various network architectures for teacher and
student models, demonstrating the effectiveness and flexibility of the proposed
attention based soft layer association mechanism for cross-layer distillation.
Related papers
- TAS: Distilling Arbitrary Teacher and Student via a Hybrid Assistant [52.0297393822012]
We introduce an assistant model as a bridge to facilitate smooth feature knowledge transfer between heterogeneous teachers and students.
Within our proposed design principle, the assistant model combines the advantages of cross-architecture inductive biases and module functions.
Our proposed method is evaluated across some homogeneous model pairs and arbitrary heterogeneous combinations of CNNs, ViTs, spatial KDs.
arXiv Detail & Related papers (2024-10-16T08:02:49Z) - Harmonizing knowledge Transfer in Neural Network with Unified Distillation [20.922545937770085]
Knowledge distillation (KD) is known for its ability to transfer knowledge from a cumbersome network (teacher) to a lightweight one (student) without altering the architecture.
This paper introduces a novel perspective by leveraging diverse knowledge sources within a unified KD framework.
arXiv Detail & Related papers (2024-09-27T09:09:45Z) - Masked Image Modeling with Local Multi-Scale Reconstruction [54.91442074100597]
Masked Image Modeling (MIM) achieves outstanding success in self-supervised representation learning.
Existing MIM models conduct reconstruction task only at the top layer of encoder.
We design local multi-scale reconstruction, where the lower and upper layers reconstruct fine-scale and coarse-scale supervision signals respectively.
arXiv Detail & Related papers (2023-03-09T13:42:04Z) - Knowledge Distillation from A Stronger Teacher [44.11781464210916]
This paper presents a method dubbed DIST to distill better from a stronger teacher.
We empirically find that the discrepancy of predictions between the student and a stronger teacher may tend to be fairly severer.
Our method is simple yet practical, and extensive experiments demonstrate that it adapts well to various architectures.
arXiv Detail & Related papers (2022-05-21T08:30:58Z) - Weakly Supervised Semantic Segmentation via Alternative Self-Dual
Teaching [82.71578668091914]
This paper establishes a compact learning framework that embeds the classification and mask-refinement components into a unified deep model.
We propose a novel alternative self-dual teaching (ASDT) mechanism to encourage high-quality knowledge interaction.
arXiv Detail & Related papers (2021-12-17T11:56:56Z) - RAIL-KD: RAndom Intermediate Layer Mapping for Knowledge Distillation [24.951887361152988]
We propose a RAndom Intermediate Layer Knowledge Distillation (RAIL-KD) approach in which, intermediate layers from the teacher model are selected randomly to be distilled into the intermediate layers of the student model.
We show that our proposed RAIL-KD approach outperforms other state-of-the-art intermediate layer KD methods considerably in both performance and training-time.
arXiv Detail & Related papers (2021-09-21T13:21:13Z) - Graph Consistency based Mean-Teaching for Unsupervised Domain Adaptive
Person Re-Identification [54.58165777717885]
This paper proposes a Graph Consistency based Mean-Teaching (GCMT) method with constructing the Graph Consistency Constraint (GCC) between teacher and student networks.
Experiments on three datasets, i.e., Market-1501, DukeMTMCreID, and MSMT17, show that proposed GCMT outperforms state-of-the-art methods by clear margin.
arXiv Detail & Related papers (2021-05-11T04:09:49Z) - Multi-head Knowledge Distillation for Model Compression [65.58705111863814]
We propose a simple-to-implement method using auxiliary classifiers at intermediate layers for matching features.
We show that the proposed method outperforms prior relevant approaches presented in the literature.
arXiv Detail & Related papers (2020-12-05T00:49:14Z) - BERT-EMD: Many-to-Many Layer Mapping for BERT Compression with Earth
Mover's Distance [25.229624487344186]
High storage and computational costs obstruct pre-trained language models to be effectively deployed on resource-constrained devices.
We propose a novel BERT distillation method based on many-to-many layer mapping.
Our model can learn from different teacher layers adaptively for various NLP tasks.
arXiv Detail & Related papers (2020-10-13T02:53:52Z) - Heterogeneous Knowledge Distillation using Information Flow Modeling [82.83891707250926]
We propose a novel KD method that works by modeling the information flow through the various layers of the teacher model.
The proposed method is capable of overcoming the aforementioned limitations by using an appropriate supervision scheme during the different phases of the training process.
arXiv Detail & Related papers (2020-05-02T06:56:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.