DearKD: Data-Efficient Early Knowledge Distillation for Vision
Transformers
- URL: http://arxiv.org/abs/2204.12997v2
- Date: Thu, 28 Apr 2022 14:36:21 GMT
- Title: DearKD: Data-Efficient Early Knowledge Distillation for Vision
Transformers
- Authors: Xianing Chen, Qiong Cao, Yujie Zhong, Jing Zhang, Shenghua Gao,
Dacheng Tao
- Abstract summary: We propose an early knowledge distillation framework, which is termed as DearKD, to improve the data efficiency required by transformers.
Our DearKD is a two-stage framework that first distills the inductive biases from the early intermediate layers of a CNN and then gives the transformer full play by training without distillation.
- Score: 91.6129538027725
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers are successfully applied to computer vision due to their
powerful modeling capacity with self-attention. However, the excellent
performance of transformers heavily depends on enormous training images. Thus,
a data-efficient transformer solution is urgently needed. In this work, we
propose an early knowledge distillation framework, which is termed as DearKD,
to improve the data efficiency required by transformers. Our DearKD is a
two-stage framework that first distills the inductive biases from the early
intermediate layers of a CNN and then gives the transformer full play by
training without distillation. Further, our DearKD can be readily applied to
the extreme data-free case where no real images are available. In this case, we
propose a boundary-preserving intra-divergence loss based on DeepInversion to
further close the performance gap against the full-data counterpart. Extensive
experiments on ImageNet, partial ImageNet, data-free setting and other
downstream tasks prove the superiority of DearKD over its baselines and
state-of-the-art methods.
Related papers
- Image-Conditional Diffusion Transformer for Underwater Image Enhancement [4.555168682310286]
We propose a novel UIE method based on image-conditional diffusion transformer (ICDT)
Our method takes the degraded underwater image as the conditional input and converts it into latent space where ICDT is applied.
Our largest model, ICDT-XL/2, outperforms all comparison methods, achieving state-of-the-art (SOTA) quality of image enhancement.
arXiv Detail & Related papers (2024-07-07T14:34:31Z) - Remote Sensing Change Detection With Transformers Trained from Scratch [62.96911491252686]
transformer-based change detection (CD) approaches either employ a pre-trained model trained on large-scale image classification ImageNet dataset or rely on first pre-training on another CD dataset and then fine-tuning on the target benchmark.
We develop an end-to-end CD approach with transformers that is trained from scratch and yet achieves state-of-the-art performance on four public benchmarks.
arXiv Detail & Related papers (2023-04-13T17:57:54Z) - Supervised Masked Knowledge Distillation for Few-Shot Transformers [36.46755346410219]
We propose a novel Supervised Masked Knowledge Distillation model (SMKD) for few-shot Transformers.
Compared with previous self-supervised methods, we allow intra-class knowledge distillation on both class and patch tokens.
Our method with simple design outperforms previous methods by a large margin and achieves a new start-of-the-art.
arXiv Detail & Related papers (2023-03-25T03:31:46Z) - Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer [56.87383229709899]
We develop an information rectification module (IRM) and a distribution guided distillation scheme for fully quantized vision transformers (Q-ViT)
Our method achieves a much better performance than the prior arts.
arXiv Detail & Related papers (2022-10-13T04:00:29Z) - AdaViT: Adaptive Vision Transformers for Efficient Image Recognition [78.07924262215181]
We introduce AdaViT, an adaptive framework that learns to derive usage policies on which patches, self-attention heads and transformer blocks to use.
Our method obtains more than 2x improvement on efficiency compared to state-of-the-art vision transformers with only 0.8% drop of accuracy.
arXiv Detail & Related papers (2021-11-30T18:57:02Z) - Efficient Vision Transformers via Fine-Grained Manifold Distillation [96.50513363752836]
Vision transformer architectures have shown extraordinary performance on many computer vision tasks.
Although the network performance is boosted, transformers are often required more computational resources.
We propose to excavate useful information from the teacher transformer through the relationship between images and the divided patches.
arXiv Detail & Related papers (2021-07-03T08:28:34Z) - Pre-Trained Image Processing Transformer [95.93031793337613]
We develop a new pre-trained model, namely, image processing transformer (IPT)
We present to utilize the well-known ImageNet benchmark for generating a large amount of corrupted image pairs.
IPT model is trained on these images with multi-heads and multi-tails.
arXiv Detail & Related papers (2020-12-01T09:42:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.