ViTKD: Practical Guidelines for ViT feature knowledge distillation
- URL: http://arxiv.org/abs/2209.02432v1
- Date: Tue, 6 Sep 2022 11:52:46 GMT
- Title: ViTKD: Practical Guidelines for ViT feature knowledge distillation
- Authors: Zhendong Yang, Zhe Li, Ailing Zeng, Zexian Li, Chun Yuan, Yu Li
- Abstract summary: Vision Transformer (ViT) has achieved great success on many computer vision tasks.
We propose our feature-based method ViTKD which brings consistent and considerable improvement to the student.
On ImageNet-1k, we boost DeiT-Tiny from 74.42% to 76.06%, DeiT-Small from 80.55% to 81.95%, and DeiT-Base from 81.76% to 83.46%.
- Score: 23.8103504246977
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge Distillation (KD) for Convolutional Neural Network (CNN) is
extensively studied as a way to boost the performance of a small model.
Recently, Vision Transformer (ViT) has achieved great success on many computer
vision tasks and KD for ViT is also desired. However, besides the output
logit-based KD, other feature-based KD methods for CNNs cannot be directly
applied to ViT due to the huge structure gap. In this paper, we explore the way
of feature-based distillation for ViT. Based on the nature of feature maps in
ViT, we design a series of controlled experiments and derive three practical
guidelines for ViT's feature distillation. Some of our findings are even
opposite to the practices in the CNN era. Based on the three guidelines, we
propose our feature-based method ViTKD which brings consistent and considerable
improvement to the student. On ImageNet-1k, we boost DeiT-Tiny from 74.42% to
76.06%, DeiT-Small from 80.55% to 81.95%, and DeiT-Base from 81.76% to 83.46%.
Moreover, ViTKD and the logit-based KD method are complementary and can be
applied together directly. This combination can further improve the performance
of the student. Specifically, the student DeiT-Tiny, Small, and Base achieve
77.78%, 83.59%, and 85.41%, respectively. The code is available at
https://github.com/yzd-v/cls_KD.
Related papers
- DeiT-LT Distillation Strikes Back for Vision Transformer Training on Long-Tailed Datasets [30.178427266135756]
Vision Transformer (ViT) has emerged as a prominent architecture for various computer vision tasks.
ViT requires a large amount of data for pre-training.
We introduce DeiT-LT to tackle the problem of training ViTs from scratch on long-tailed datasets.
arXiv Detail & Related papers (2024-04-03T17:58:21Z) - TVT: Training-Free Vision Transformer Search on Tiny Datasets [32.1204216324339]
Training-free Vision Transformer (ViT) architecture search is presented to search for a better ViT with zero-cost proxies.
Our TVT searches for the best ViT for distilling with ConvNet teachers via our teacher-aware metric and student-capability metric.
arXiv Detail & Related papers (2023-11-24T08:24:31Z) - ViT-Lens: Initiating Omni-Modal Exploration through 3D Insights [61.36309876889977]
ViT-Lens enables efficient omni-modal representation learning by perceiving novel modalities with a pretrained ViT and aligning to a pre-defined space.
In zero-shot 3D classification, ViT-Lens achieves substantial improvements over previous state-of-the-art.
We will release the results of ViT-Lens on more modalities in the near future.
arXiv Detail & Related papers (2023-08-20T07:26:51Z) - Convolutional Embedding Makes Hierarchical Vision Transformer Stronger [16.72943631060293]
Vision Transformers (ViTs) have recently dominated a range of computer vision tasks, yet it suffers from low training data efficiency and inferior local semantic representation capability without appropriate inductive bias.
CNNs inherently capture regional-aware semantics, inspiring researchers to introduce CNNs back into the architecture of the ViTs to provide desirable inductive bias for ViTs.
In this paper, we explore how the macro architecture of the hybrid CNNs/ViTs enhances the performances of hierarchical ViTs.
arXiv Detail & Related papers (2022-07-27T06:36:36Z) - DeiT III: Revenge of the ViT [56.46810490275699]
A Vision Transformer (ViT) is a simple neural architecture amenable to serve several computer vision tasks.
Recent works show that ViTs benefit from self-supervised pre-training, in particular BerT-like pre-training like BeiT.
arXiv Detail & Related papers (2022-04-14T17:13:44Z) - Auto-scaling Vision Transformers without Training [84.34662535276898]
We propose As-ViT, an auto-scaling framework for Vision Transformers (ViTs) without training.
As-ViT automatically discovers and scales up ViTs in an efficient and principled manner.
As a unified framework, As-ViT achieves strong performance on classification and detection.
arXiv Detail & Related papers (2022-02-24T06:30:55Z) - Improving Vision Transformers for Incremental Learning [17.276384689286168]
This paper studies using Vision Transformers (ViT) in class incremental learning.
ViT has very slow convergence when class number is small.
More bias towards new classes is observed in ViT than CNN-based models.
arXiv Detail & Related papers (2021-12-12T00:12:33Z) - A Unified Pruning Framework for Vision Transformers [40.7622551128182]
Vision transformer (ViT) and its variants have achieved promising performances in various computer vision tasks.
We propose a unified framework for structural pruning of both ViTs and its variants, namely UP-ViTs.
Our method focuses on pruning all ViTs components while maintaining the consistency of the model structure.
arXiv Detail & Related papers (2021-11-30T05:01:02Z) - Self-slimmed Vision Transformer [52.67243496139175]
Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural networks (CNNs) on various vision tasks.
We propose a generic self-slimmed learning approach for vanilla ViTs, namely SiT.
Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs.
arXiv Detail & Related papers (2021-11-24T16:48:57Z) - How and When Adversarial Robustness Transfers in Knowledge Distillation? [137.11016173468457]
This paper studies how and when the adversarial robustness can be transferred from a teacher model to a student model in Knowledge distillation (KD)
We show that standard KD training fails to preserve adversarial robustness, and we propose KD with input gradient alignment (KDIGA) for remedy.
Under certain assumptions, we prove that the student model using our proposed KDIGA can achieve at least the same certified robustness as the teacher model.
arXiv Detail & Related papers (2021-10-22T21:30:53Z) - Emerging Properties in Self-Supervised Vision Transformers [57.36837447500544]
We show that self-supervised ViTs provide new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets)
We implement our findings into a simple self-supervised method, called DINO, which we interpret as a form of self-distillation with no labels.
We show the synergy between DINO and ViTs by achieving 80.1% top-1 on ImageNet in linear evaluation with ViT-Base.
arXiv Detail & Related papers (2021-04-29T12:28:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.