One-for-All: Bridge the Gap Between Heterogeneous Architectures in
Knowledge Distillation
- URL: http://arxiv.org/abs/2310.19444v1
- Date: Mon, 30 Oct 2023 11:13:02 GMT
- Title: One-for-All: Bridge the Gap Between Heterogeneous Architectures in
Knowledge Distillation
- Authors: Zhiwei Hao, Jianyuan Guo, Kai Han, Yehui Tang, Han Hu, Yunhe Wang,
Chang Xu
- Abstract summary: Knowledge distillation has proven to be a highly effective approach for enhancing model performance through a teacher-student training scheme.
Most existing distillation methods are designed under the assumption that the teacher and student models belong to the same model family.
We propose a simple yet effective one-for-all KD framework called OFA-KD, which significantly improves the distillation performance between heterogeneous architectures.
- Score: 69.65734716679925
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation~(KD) has proven to be a highly effective approach for
enhancing model performance through a teacher-student training scheme. However,
most existing distillation methods are designed under the assumption that the
teacher and student models belong to the same model family, particularly the
hint-based approaches. By using centered kernel alignment (CKA) to compare the
learned features between heterogeneous teacher and student models, we observe
significant feature divergence. This divergence illustrates the ineffectiveness
of previous hint-based methods in cross-architecture distillation. To tackle
the challenge in distilling heterogeneous models, we propose a simple yet
effective one-for-all KD framework called OFA-KD, which significantly improves
the distillation performance between heterogeneous architectures. Specifically,
we project intermediate features into an aligned latent space such as the
logits space, where architecture-specific information is discarded.
Additionally, we introduce an adaptive target enhancement scheme to prevent the
student from being disturbed by irrelevant information. Extensive experiments
with various architectures, including CNN, Transformer, and MLP, demonstrate
the superiority of our OFA-KD framework in enabling distillation between
heterogeneous architectures. Specifically, when equipped with our OFA-KD, the
student models achieve notable performance improvements, with a maximum gain of
8.0% on the CIFAR-100 dataset and 0.7% on the ImageNet-1K dataset. PyTorch code
and checkpoints can be found at https://github.com/Hao840/OFAKD.
Related papers
- TAS: Distilling Arbitrary Teacher and Student via a Hybrid Assistant [52.0297393822012]
We introduce an assistant model as a bridge to facilitate smooth feature knowledge transfer between heterogeneous teachers and students.
Within our proposed design principle, the assistant model combines the advantages of cross-architecture inductive biases and module functions.
Our proposed method is evaluated across some homogeneous model pairs and arbitrary heterogeneous combinations of CNNs, ViTs, spatial KDs.
arXiv Detail & Related papers (2024-10-16T08:02:49Z) - Aligning in a Compact Space: Contrastive Knowledge Distillation between Heterogeneous Architectures [4.119589507611071]
We propose a Low-Frequency Components-based Contrastive Knowledge Distillation (LFCC) framework that significantly enhances the performance of feature-based distillation.
Specifically, we designe a set of multi-scale low-pass filters to extract the low-frequency components of intermediate features from both the teacher and student models.
We show that LFCC achieves superior performance on the challenging benchmarks of ImageNet-1K and CIFAR-100.
arXiv Detail & Related papers (2024-05-28T18:44:42Z) - Boosting the Cross-Architecture Generalization of Dataset Distillation through an Empirical Study [52.83643622795387]
Cross-architecture generalization of dataset distillation weakens its practical significance.
We propose a novel method of EvaLuation with distillation Feature (ELF)
By performing extensive experiments, we successfully prove that ELF can well enhance the cross-architecture generalization of current DD methods.
arXiv Detail & Related papers (2023-12-09T15:41:42Z) - Robustness-Reinforced Knowledge Distillation with Correlation Distance
and Network Pruning [3.1423836318272773]
Knowledge distillation (KD) improves the performance of efficient and lightweight models.
Most existing KD techniques rely on Kullback-Leibler (KL) divergence.
We propose a Robustness-Reinforced Knowledge Distillation (R2KD) that leverages correlation distance and network pruning.
arXiv Detail & Related papers (2023-11-23T11:34:48Z) - Knowledge Diffusion for Distillation [53.908314960324915]
The representation gap between teacher and student is an emerging topic in knowledge distillation (KD)
We state that the essence of these methods is to discard the noisy information and distill the valuable information in the feature.
We propose a novel KD method dubbed DiffKD, to explicitly denoise and match features using diffusion models.
arXiv Detail & Related papers (2023-05-25T04:49:34Z) - DisWOT: Student Architecture Search for Distillation WithOut Training [0.0]
We explore a novel training-free framework to search for the best student architectures for a given teacher.
Our work first empirically show that the optimal model under vanilla training cannot be the winner in distillation.
Our experiments on CIFAR, ImageNet and NAS-Bench-201 demonstrate that our technique achieves state-of-the-art results on different search spaces.
arXiv Detail & Related papers (2023-03-28T01:58:45Z) - EmbedDistill: A Geometric Knowledge Distillation for Information
Retrieval [83.79667141681418]
Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR)
We propose a novel distillation approach that leverages the relative geometry among queries and documents learned by the large teacher model.
We show that our approach successfully distills from both dual-encoder (DE) and cross-encoder (CE) teacher models to 1/10th size asymmetric students that can retain 95-97% of the teacher performance.
arXiv Detail & Related papers (2023-01-27T22:04:37Z) - Directed Acyclic Graph Factorization Machines for CTR Prediction via
Knowledge Distillation [65.62538699160085]
We propose a Directed Acyclic Graph Factorization Machine (KD-DAGFM) to learn the high-order feature interactions from existing complex interaction models for CTR prediction via Knowledge Distillation.
KD-DAGFM achieves the best performance with less than 21.5% FLOPs of the state-of-the-art method on both online and offline experiments.
arXiv Detail & Related papers (2022-11-21T03:09:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.