Related papers: Distilling Efficient Vision Transformers from CNNs for Semantic Segmentation

Distilling Efficient Vision Transformers from CNNs for Semantic Segmentation

URL: http://arxiv.org/abs/2310.07265v1
Date: Wed, 11 Oct 2023 07:45:37 GMT
Title: Distilling Efficient Vision Transformers from CNNs for Semantic Segmentation
Authors: Xu Zheng, Yunhao Luo, Pengyuan Zhou, Lin Wang
Abstract summary: We propose a novel CNN-to-ViT KD framework, dubbed C2VKD. We first propose a novel visual-linguistic feature distillation (VLFD) module that explores efficient KD among the aligned visual and linguistic-compatible representations. We then propose a pixel-wise decoupled distillation (PDD) module to supervise the student under the combination of labels and teacher's predictions from the decoupled target and non-target classes.
Score: 12.177329445930276
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we tackle a new problem: how to transfer knowledge from the pre-trained cumbersome yet well-performed CNN-based model to learn a compact Vision Transformer (ViT)-based model while maintaining its learning capacity? Due to the completely different characteristics of ViT and CNN and the long-existing capacity gap between teacher and student models in Knowledge Distillation (KD), directly transferring the cross-model knowledge is non-trivial. To this end, we subtly leverage the visual and linguistic-compatible feature character of ViT (i.e., student), and its capacity gap with the CNN (i.e., teacher) and propose a novel CNN-to-ViT KD framework, dubbed C2VKD. Importantly, as the teacher's features are heterogeneous to those of the student, we first propose a novel visual-linguistic feature distillation (VLFD) module that explores efficient KD among the aligned visual and linguistic-compatible representations. Moreover, due to the large capacity gap between the teacher and student and the inevitable prediction errors of the teacher, we then propose a pixel-wise decoupled distillation (PDD) module to supervise the student under the combination of labels and teacher's predictions from the decoupled target and non-target classes. Experiments on three semantic segmentation benchmark datasets consistently show that the increment of mIoU of our method is over 200% of the SoTA KD methods

Related papers

TAS: Distilling Arbitrary Teacher and Student via a Hybrid Assistant [52.0297393822012]
We introduce an assistant model as a bridge to facilitate smooth feature knowledge transfer between heterogeneous teachers and students. Within our proposed design principle, the assistant model combines the advantages of cross-architecture inductive biases and module functions. Our proposed method is evaluated across some homogeneous model pairs and arbitrary heterogeneous combinations of CNNs, ViTs, spatial KDs.
arXiv Detail & Related papers (2024-10-16T08:02:49Z)
Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them. This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model. OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z)
Multi Teacher Privileged Knowledge Distillation for Multimodal Expression Recognition [58.41784639847413]
Human emotion is a complex phenomenon conveyed and perceived through facial expressions, vocal tones, body language, and physiological signals. In this paper, a multi-teacher PKD (MT-PKDOT) method with self-distillation is introduced to align diverse teacher representations before distilling them to the student. Results indicate that our proposed method can outperform SOTA PKD methods.
arXiv Detail & Related papers (2024-08-16T22:11:01Z)
ComKD-CLIP: Comprehensive Knowledge Distillation for Contrastive Language-Image Pre-traning Model [49.587821411012705]
We propose ComKD-CLIP: Comprehensive Knowledge Distillation for Contrastive Language-Image Pre-traning Model. It distills the knowledge from a large teacher CLIP model into a smaller student model, ensuring comparable performance with significantly reduced parameters. EduAttention explores the cross-relationships between text features extracted by the teacher model and image features extracted by the student model.
arXiv Detail & Related papers (2024-08-08T01:12:21Z)
HDKD: Hybrid Data-Efficient Knowledge Distillation Network for Medical Image Classification [0.0]
Vision Transformers (ViTs) have achieved significant advancement in computer vision tasks due to their powerful modeling capacity. Previous approaches to Knowledge Distillation (KD) have pursued two primary paths: some focused solely on distilling the logit distribution from CNN teacher to ViT student. This paper presents Hybrid Data-efficient Knowledge Distillation (HDKD) paradigm which employs a CNN teacher and a hybrid student.
arXiv Detail & Related papers (2024-07-10T10:09:12Z)
I2CKD : Intra- and Inter-Class Knowledge Distillation for Semantic Segmentation [1.433758865948252]
This paper proposes a new knowledge distillation method tailored for image semantic segmentation, termed Intra- and Inter-Class Knowledge Distillation (I2CKD) The focus of this method is on capturing and transferring knowledge between the intermediate layers of teacher (cumbersome model) and student (compact model)
arXiv Detail & Related papers (2024-03-27T12:05:22Z)
PromptKD: Unsupervised Prompt Distillation for Vision-Language Models [40.858721356497085]
We introduce an unsupervised domain prompt distillation framework, which aims to transfer the knowledge of a larger teacher model to a lightweight target model. Our framework consists of two distinct stages. In the initial stage, we pre-train a large CLIP teacher model using domain (few-shot) labels. In the subsequent stage, the stored class vectors are shared across teacher and student image encoders for calculating the predicted logits.
arXiv Detail & Related papers (2024-03-05T08:53:30Z)
A Good Student is Cooperative and Reliable: CNN-Transformer Collaborative Learning for Semantic Segmentation [8.110815355364947]
We propose an online knowledge distillation (KD) framework that can simultaneously learn CNN-based and ViT-based models. Our proposed framework outperforms the state-of-the-art online distillation methods by a large margin.
arXiv Detail & Related papers (2023-07-24T07:46:06Z)
Cross Architecture Distillation for Face Recognition [49.55061794917994]
We develop an Adaptable Prompting Teacher network (APT) that integrates prompts into the teacher, enabling it to manage distillation-specific knowledge. Experiments on popular face benchmarks and two large-scale verification sets demonstrate the superiority of our method.
arXiv Detail & Related papers (2023-06-26T12:54:28Z)
Exploring Inter-Channel Correlation for Diversity-preserved KnowledgeDistillation [91.56643684860062]
Inter-Channel Correlation for Knowledge Distillation(ICKD) is developed. ICKD captures intrinsic distribution of the featurespace and sufficient diversity properties of features in the teacher network. We are the first method based on knowl-edge distillation boosts ResNet18 beyond 72% Top-1 ac-curacy on ImageNet classification.
arXiv Detail & Related papers (2022-02-08T07:01:56Z)
Wasserstein Contrastive Representation Distillation [114.24609306495456]
We propose Wasserstein Contrastive Representation Distillation (WCoRD), which leverages both primal and dual forms of Wasserstein distance for knowledge distillation. The dual form is used for global knowledge transfer, yielding a contrastive learning objective that maximizes the lower bound of mutual information between the teacher and the student networks. Experiments demonstrate that the proposed WCoRD method outperforms state-of-the-art approaches on privileged information distillation, model compression and cross-modal transfer.
arXiv Detail & Related papers (2020-12-15T23:43:28Z)
Distilling Visual Priors from Self-Supervised Learning [24.79633121345066]
Convolutional Neural Networks (CNNs) are prone to overfit small training datasets. We present a novel two-phase pipeline that leverages self-supervised learning and knowledge distillation to improve the generalization ability of CNN models for image classification under the data-deficient setting.
arXiv Detail & Related papers (2020-08-01T13:07:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.