TAS: Distilling Arbitrary Teacher and Student via a Hybrid Assistant
- URL: http://arxiv.org/abs/2410.12342v1
- Date: Wed, 16 Oct 2024 08:02:49 GMT
- Title: TAS: Distilling Arbitrary Teacher and Student via a Hybrid Assistant
- Authors: Guopeng Li, Qiang Wang, Ke Yan, Shouhong Ding, Yuan Gao, Gui-Song Xia,
- Abstract summary: We introduce an assistant model as a bridge to facilitate smooth feature knowledge transfer between heterogeneous teachers and students.
Within our proposed design principle, the assistant model combines the advantages of cross-architecture inductive biases and module functions.
Our proposed method is evaluated across some homogeneous model pairs and arbitrary heterogeneous combinations of CNNs, ViTs, spatial KDs.
- Score: 52.0297393822012
- License:
- Abstract: Most knowledge distillation (KD) methodologies predominantly focus on teacher-student pairs with similar architectures, such as both being convolutional neural networks (CNNs). However, the potential and flexibility of KD can be greatly improved by expanding it to novel Cross-Architecture KD (CAKD), where the knowledge of homogeneous and heterogeneous teachers can be transferred flexibly to a given student. The primary challenge in CAKD lies in the substantial feature gaps between heterogeneous models, originating from the distinction of their inherent inductive biases and module functions. To this end, we introduce an assistant model as a bridge to facilitate smooth feature knowledge transfer between heterogeneous teachers and students. More importantly, within our proposed design principle, the assistant model combines the advantages of cross-architecture inductive biases and module functions by merging convolution and attention modules derived from both student and teacher module functions. Furthermore, we observe that heterogeneous features exhibit diverse spatial distributions in CAKD, hindering the effectiveness of conventional pixel-wise mean squared error (MSE) loss. Therefore, we leverage a spatial-agnostic InfoNCE loss to align features after spatial smoothing, thereby improving the feature alignments in CAKD. Our proposed method is evaluated across some homogeneous model pairs and arbitrary heterogeneous combinations of CNNs, ViTs, and MLPs, achieving state-of-the-art performance for distilled models with a maximum gain of 11.47% on CIFAR-100 and 3.67% on ImageNet-1K. Our code and models will be released.
Related papers
- Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them.
This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model.
OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z) - Aligning in a Compact Space: Contrastive Knowledge Distillation between Heterogeneous Architectures [4.119589507611071]
We propose a Low-Frequency Components-based Contrastive Knowledge Distillation (LFCC) framework that significantly enhances the performance of feature-based distillation.
Specifically, we designe a set of multi-scale low-pass filters to extract the low-frequency components of intermediate features from both the teacher and student models.
We show that LFCC achieves superior performance on the challenging benchmarks of ImageNet-1K and CIFAR-100.
arXiv Detail & Related papers (2024-05-28T18:44:42Z) - Robustness-Reinforced Knowledge Distillation with Correlation Distance
and Network Pruning [3.1423836318272773]
Knowledge distillation (KD) improves the performance of efficient and lightweight models.
Most existing KD techniques rely on Kullback-Leibler (KL) divergence.
We propose a Robustness-Reinforced Knowledge Distillation (R2KD) that leverages correlation distance and network pruning.
arXiv Detail & Related papers (2023-11-23T11:34:48Z) - One-for-All: Bridge the Gap Between Heterogeneous Architectures in
Knowledge Distillation [69.65734716679925]
Knowledge distillation has proven to be a highly effective approach for enhancing model performance through a teacher-student training scheme.
Most existing distillation methods are designed under the assumption that the teacher and student models belong to the same model family.
We propose a simple yet effective one-for-all KD framework called OFA-KD, which significantly improves the distillation performance between heterogeneous architectures.
arXiv Detail & Related papers (2023-10-30T11:13:02Z) - Knowledge Diffusion for Distillation [53.908314960324915]
The representation gap between teacher and student is an emerging topic in knowledge distillation (KD)
We state that the essence of these methods is to discard the noisy information and distill the valuable information in the feature.
We propose a novel KD method dubbed DiffKD, to explicitly denoise and match features using diffusion models.
arXiv Detail & Related papers (2023-05-25T04:49:34Z) - A Lightweight Domain Adversarial Neural Network Based on Knowledge
Distillation for EEG-based Cross-subject Emotion Recognition [8.9104681425275]
Individual differences of Electroencephalogram (EEG) could cause the domain shift which would significantly degrade the performance of cross-subject strategy.
In this work, we propose knowledge distillation (KD) based lightweight DANN to enhance cross-subject EEG-based emotion recognition.
arXiv Detail & Related papers (2023-05-12T13:05:12Z) - Structural Knowledge Distillation for Object Detection [2.690687471762886]
We propose a replacement for the pixel-wise independent lp-norm based on the structural similarity (SSIM)
Our method adds only little computational overhead, is straightforward to implement and at the same time it significantly outperforms the standard lp-norms.
arXiv Detail & Related papers (2022-11-23T17:06:52Z) - Exploring Inter-Channel Correlation for Diversity-preserved
KnowledgeDistillation [91.56643684860062]
Inter-Channel Correlation for Knowledge Distillation(ICKD) is developed.
ICKD captures intrinsic distribution of the featurespace and sufficient diversity properties of features in the teacher network.
We are the first method based on knowl-edge distillation boosts ResNet18 beyond 72% Top-1 ac-curacy on ImageNet classification.
arXiv Detail & Related papers (2022-02-08T07:01:56Z) - Weakly Supervised Semantic Segmentation via Alternative Self-Dual
Teaching [82.71578668091914]
This paper establishes a compact learning framework that embeds the classification and mask-refinement components into a unified deep model.
We propose a novel alternative self-dual teaching (ASDT) mechanism to encourage high-quality knowledge interaction.
arXiv Detail & Related papers (2021-12-17T11:56:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.