CORSD: Class-Oriented Relational Self Distillation
- URL: http://arxiv.org/abs/2305.00918v1
- Date: Fri, 28 Apr 2023 16:00:31 GMT
- Title: CORSD: Class-Oriented Relational Self Distillation
- Authors: Muzhou Yu, Sia Huat Tan, Kailu Wu, Runpei Dong, Linfeng Zhang,
Kaisheng Ma
- Abstract summary: Knowledge distillation conducts an effective model compression method while holding some limitations.
We propose a novel training framework named Class-Oriented Self Distillation (CORSD) to address the limitations.
- Score: 16.11986532440837
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation conducts an effective model compression method while
holding some limitations:(1) the feature based distillation methods only focus
on distilling the feature map but are lack of transferring the relation of data
examples; (2) the relational distillation methods are either limited to the
handcrafted functions for relation extraction, such as L2 norm, or weak in
inter- and intra- class relation modeling. Besides, the feature divergence of
heterogeneous teacher-student architectures may lead to inaccurate relational
knowledge transferring. In this work, we propose a novel training framework
named Class-Oriented Relational Self Distillation (CORSD) to address the
limitations. The trainable relation networks are designed to extract relation
of structured data input, and they enable the whole model to better classify
samples by transferring the relational knowledge from the deepest layer of the
model to shallow layers. Besides, auxiliary classifiers are proposed to make
relation networks capture class-oriented relation that benefits classification
task. Experiments demonstrate that CORSD achieves remarkable improvements.
Compared to baseline, 3.8%, 1.5% and 4.5% averaged accuracy boost can be
observed on CIFAR100, ImageNet and CUB-200-2011, respectively.
Related papers
- TAS: Distilling Arbitrary Teacher and Student via a Hybrid Assistant [52.0297393822012]
We introduce an assistant model as a bridge to facilitate smooth feature knowledge transfer between heterogeneous teachers and students.
Within our proposed design principle, the assistant model combines the advantages of cross-architecture inductive biases and module functions.
Our proposed method is evaluated across some homogeneous model pairs and arbitrary heterogeneous combinations of CNNs, ViTs, spatial KDs.
arXiv Detail & Related papers (2024-10-16T08:02:49Z) - Relational Representation Distillation [6.24302896438145]
We introduce Representation Distillation (RRD) to explore and reinforce relationships between teacher and student models.
Inspired by self-supervised learning principles, it uses a relaxed contrastive loss that focuses on similarity than exact replication.
Our approach demonstrates superior performance on CIFAR-100 and ImageNet ILSVRC-2012 and sometimes even outperforms the teacher network when combined with KD.
arXiv Detail & Related papers (2024-07-16T14:56:13Z) - One-for-All: Bridge the Gap Between Heterogeneous Architectures in
Knowledge Distillation [69.65734716679925]
Knowledge distillation has proven to be a highly effective approach for enhancing model performance through a teacher-student training scheme.
Most existing distillation methods are designed under the assumption that the teacher and student models belong to the same model family.
We propose a simple yet effective one-for-all KD framework called OFA-KD, which significantly improves the distillation performance between heterogeneous architectures.
arXiv Detail & Related papers (2023-10-30T11:13:02Z) - EmbedDistill: A Geometric Knowledge Distillation for Information
Retrieval [83.79667141681418]
Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR)
We propose a novel distillation approach that leverages the relative geometry among queries and documents learned by the large teacher model.
We show that our approach successfully distills from both dual-encoder (DE) and cross-encoder (CE) teacher models to 1/10th size asymmetric students that can retain 95-97% of the teacher performance.
arXiv Detail & Related papers (2023-01-27T22:04:37Z) - Continual Contrastive Finetuning Improves Low-Resource Relation
Extraction [34.76128090845668]
Relation extraction has been particularly challenging in low-resource scenarios and domains.
Recent literature has tackled low-resource RE by self-supervised learning.
We propose to pretrain and finetune the RE model using consistent objectives of contrastive learning.
arXiv Detail & Related papers (2022-12-21T07:30:22Z) - Directed Acyclic Graph Factorization Machines for CTR Prediction via
Knowledge Distillation [65.62538699160085]
We propose a Directed Acyclic Graph Factorization Machine (KD-DAGFM) to learn the high-order feature interactions from existing complex interaction models for CTR prediction via Knowledge Distillation.
KD-DAGFM achieves the best performance with less than 21.5% FLOPs of the state-of-the-art method on both online and offline experiments.
arXiv Detail & Related papers (2022-11-21T03:09:42Z) - Weakly Supervised Semantic Segmentation via Alternative Self-Dual
Teaching [82.71578668091914]
This paper establishes a compact learning framework that embeds the classification and mask-refinement components into a unified deep model.
We propose a novel alternative self-dual teaching (ASDT) mechanism to encourage high-quality knowledge interaction.
arXiv Detail & Related papers (2021-12-17T11:56:56Z) - Complementary Relation Contrastive Distillation [13.944372633594085]
We propose a novel knowledge distillation method, namely Complementary Relation Contrastive Distillation (CRCD)
We estimate the mutual relation in an anchor-based way and distill the anchor-student relation under the supervision of its corresponding anchor-teacher relation.
Experiments on different benchmarks demonstrate the effectiveness of our proposed CRCD.
arXiv Detail & Related papers (2021-03-29T02:43:03Z) - Similarity Transfer for Knowledge Distillation [25.042405967561212]
Knowledge distillation is a popular paradigm for learning portable neural networks by transferring the knowledge from a large model into a smaller one.
We propose a novel method called similarity transfer for knowledge distillation (STKD), which aims to fully utilize the similarities between categories of multiple samples.
It shows that STKD substantially has outperformed the vanilla knowledge distillation and has achieved superior accuracy over the state-of-the-art knowledge distillation methods.
arXiv Detail & Related papers (2021-03-18T06:54:59Z) - Wasserstein Contrastive Representation Distillation [114.24609306495456]
We propose Wasserstein Contrastive Representation Distillation (WCoRD), which leverages both primal and dual forms of Wasserstein distance for knowledge distillation.
The dual form is used for global knowledge transfer, yielding a contrastive learning objective that maximizes the lower bound of mutual information between the teacher and the student networks.
Experiments demonstrate that the proposed WCoRD method outperforms state-of-the-art approaches on privileged information distillation, model compression and cross-modal transfer.
arXiv Detail & Related papers (2020-12-15T23:43:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.