Distilling Efficient Vision Transformers from CNNs for Semantic
Segmentation
- URL: http://arxiv.org/abs/2310.07265v1
- Date: Wed, 11 Oct 2023 07:45:37 GMT
- Title: Distilling Efficient Vision Transformers from CNNs for Semantic
Segmentation
- Authors: Xu Zheng, Yunhao Luo, Pengyuan Zhou, Lin Wang
- Abstract summary: We propose a novel CNN-to-ViT KD framework, dubbed C2VKD.
We first propose a novel visual-linguistic feature distillation (VLFD) module that explores efficient KD among the aligned visual and linguistic-compatible representations.
We then propose a pixel-wise decoupled distillation (PDD) module to supervise the student under the combination of labels and teacher's predictions from the decoupled target and non-target classes.
- Score: 12.177329445930276
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we tackle a new problem: how to transfer knowledge from the
pre-trained cumbersome yet well-performed CNN-based model to learn a compact
Vision Transformer (ViT)-based model while maintaining its learning capacity?
Due to the completely different characteristics of ViT and CNN and the
long-existing capacity gap between teacher and student models in Knowledge
Distillation (KD), directly transferring the cross-model knowledge is
non-trivial. To this end, we subtly leverage the visual and
linguistic-compatible feature character of ViT (i.e., student), and its
capacity gap with the CNN (i.e., teacher) and propose a novel CNN-to-ViT KD
framework, dubbed C2VKD. Importantly, as the teacher's features are
heterogeneous to those of the student, we first propose a novel
visual-linguistic feature distillation (VLFD) module that explores efficient KD
among the aligned visual and linguistic-compatible representations. Moreover,
due to the large capacity gap between the teacher and student and the
inevitable prediction errors of the teacher, we then propose a pixel-wise
decoupled distillation (PDD) module to supervise the student under the
combination of labels and teacher's predictions from the decoupled target and
non-target classes. Experiments on three semantic segmentation benchmark
datasets consistently show that the increment of mIoU of our method is over
200% of the SoTA KD methods
Related papers
- HDKD: Hybrid Data-Efficient Knowledge Distillation Network for Medical Image Classification [0.0]
Vision Transformers (ViTs) have achieved significant advancement in computer vision tasks due to their powerful modeling capacity.
Previous approaches to Knowledge Distillation (KD) have pursued two primary paths: some focused solely on distilling the logit distribution from CNN teacher to ViT student.
This paper presents Hybrid Data-efficient Knowledge Distillation (HDKD) paradigm which employs a CNN teacher and a hybrid student.
arXiv Detail & Related papers (2024-07-10T10:09:12Z) - I2CKD : Intra- and Inter-Class Knowledge Distillation for Semantic Segmentation [1.433758865948252]
This paper proposes a new knowledge distillation method tailored for image semantic segmentation, termed Intra- and Inter-Class Knowledge Distillation (I2CKD)
The focus of this method is on capturing and transferring knowledge between the intermediate layers of teacher (cumbersome model) and student (compact model)
arXiv Detail & Related papers (2024-03-27T12:05:22Z) - PromptKD: Unsupervised Prompt Distillation for Vision-Language Models [42.9076113835733]
We introduce an unsupervised domain prompt distillation framework, which aims to transfer the knowledge of a larger teacher model to a lightweight target model.
Our framework consists of two distinct stages. In the initial stage, we pre-train a large CLIP teacher model using domain (few-shot) labels.
In the subsequent stage, the stored class vectors are shared across teacher and student image encoders for calculating the predicted logits.
arXiv Detail & Related papers (2024-03-05T08:53:30Z) - Object-Centric Diffusion for Efficient Video Editing [64.71639719352636]
We propose a novel knowledge distillation schema coined as Delta Distillation.
We demonstrate that these temporal variations can be effectively distilled due to the temporal redundancies within video frames.
As a by-product, delta distillation improves the temporal consistency of the teacher model.
arXiv Detail & Related papers (2024-01-11T08:36:15Z) - A Good Student is Cooperative and Reliable: CNN-Transformer
Collaborative Learning for Semantic Segmentation [8.110815355364947]
We propose an online knowledge distillation (KD) framework that can simultaneously learn CNN-based and ViT-based models.
Our proposed framework outperforms the state-of-the-art online distillation methods by a large margin.
arXiv Detail & Related papers (2023-07-24T07:46:06Z) - Cross Architecture Distillation for Face Recognition [49.55061794917994]
We develop an Adaptable Prompting Teacher network (APT) that integrates prompts into the teacher, enabling it to manage distillation-specific knowledge.
Experiments on popular face benchmarks and two large-scale verification sets demonstrate the superiority of our method.
arXiv Detail & Related papers (2023-06-26T12:54:28Z) - Exploring Inter-Channel Correlation for Diversity-preserved
KnowledgeDistillation [91.56643684860062]
Inter-Channel Correlation for Knowledge Distillation(ICKD) is developed.
ICKD captures intrinsic distribution of the featurespace and sufficient diversity properties of features in the teacher network.
We are the first method based on knowl-edge distillation boosts ResNet18 beyond 72% Top-1 ac-curacy on ImageNet classification.
arXiv Detail & Related papers (2022-02-08T07:01:56Z) - How and When Adversarial Robustness Transfers in Knowledge Distillation? [137.11016173468457]
This paper studies how and when the adversarial robustness can be transferred from a teacher model to a student model in Knowledge distillation (KD)
We show that standard KD training fails to preserve adversarial robustness, and we propose KD with input gradient alignment (KDIGA) for remedy.
Under certain assumptions, we prove that the student model using our proposed KDIGA can achieve at least the same certified robustness as the teacher model.
arXiv Detail & Related papers (2021-10-22T21:30:53Z) - Wasserstein Contrastive Representation Distillation [114.24609306495456]
We propose Wasserstein Contrastive Representation Distillation (WCoRD), which leverages both primal and dual forms of Wasserstein distance for knowledge distillation.
The dual form is used for global knowledge transfer, yielding a contrastive learning objective that maximizes the lower bound of mutual information between the teacher and the student networks.
Experiments demonstrate that the proposed WCoRD method outperforms state-of-the-art approaches on privileged information distillation, model compression and cross-modal transfer.
arXiv Detail & Related papers (2020-12-15T23:43:28Z) - Distilling Visual Priors from Self-Supervised Learning [24.79633121345066]
Convolutional Neural Networks (CNNs) are prone to overfit small training datasets.
We present a novel two-phase pipeline that leverages self-supervised learning and knowledge distillation to improve the generalization ability of CNN models for image classification under the data-deficient setting.
arXiv Detail & Related papers (2020-08-01T13:07:18Z) - Distilling Knowledge from Graph Convolutional Networks [146.71503336770886]
Existing knowledge distillation methods focus on convolutional neural networks (CNNs)
We propose the first dedicated approach to distilling knowledge from a pre-trained graph convolutional network (GCN) model.
We show that our method achieves the state-of-the-art knowledge distillation performance for GCN models.
arXiv Detail & Related papers (2020-03-23T18:23:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.