Supervised Masked Knowledge Distillation for Few-Shot Transformers
- URL: http://arxiv.org/abs/2303.15466v2
- Date: Wed, 29 Mar 2023 01:59:30 GMT
- Title: Supervised Masked Knowledge Distillation for Few-Shot Transformers
- Authors: Han Lin, Guangxing Han, Jiawei Ma, Shiyuan Huang, Xudong Lin, Shih-Fu
Chang
- Abstract summary: We propose a novel Supervised Masked Knowledge Distillation model (SMKD) for few-shot Transformers.
Compared with previous self-supervised methods, we allow intra-class knowledge distillation on both class and patch tokens.
Our method with simple design outperforms previous methods by a large margin and achieves a new start-of-the-art.
- Score: 36.46755346410219
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision Transformers (ViTs) emerge to achieve impressive performance on many
data-abundant computer vision tasks by capturing long-range dependencies among
local features. However, under few-shot learning (FSL) settings on small
datasets with only a few labeled data, ViT tends to overfit and suffers from
severe performance degradation due to its absence of CNN-alike inductive bias.
Previous works in FSL avoid such problem either through the help of
self-supervised auxiliary losses, or through the dextile uses of label
information under supervised settings. But the gap between self-supervised and
supervised few-shot Transformers is still unfilled. Inspired by recent advances
in self-supervised knowledge distillation and masked image modeling (MIM), we
propose a novel Supervised Masked Knowledge Distillation model (SMKD) for
few-shot Transformers which incorporates label information into
self-distillation frameworks. Compared with previous self-supervised methods,
we allow intra-class knowledge distillation on both class and patch tokens, and
introduce the challenging task of masked patch tokens reconstruction across
intra-class images. Experimental results on four few-shot classification
benchmark datasets show that our method with simple design outperforms previous
methods by a large margin and achieves a new start-of-the-art. Detailed
ablation studies confirm the effectiveness of each component of our model. Code
for this paper is available here: https://github.com/HL-hanlin/SMKD.
Related papers
- Intra-task Mutual Attention based Vision Transformer for Few-Shot Learning [12.5354658533836]
Humans possess remarkable ability to accurately classify new, unseen images after being exposed to only a few examples.
For artificial neural network models, determining the most relevant features for distinguishing between two images with limited samples presents a challenge.
We propose an intra-task mutual attention method for few-shot learning, that involves splitting the support and query samples into patches.
arXiv Detail & Related papers (2024-05-06T02:02:57Z) - Weight Copy and Low-Rank Adaptation for Few-Shot Distillation of Vision Transformers [22.1372572833618]
We propose a novel few-shot feature distillation approach for vision transformers.
We first copy the weights from intermittent layers of existing vision transformers into shallower architectures (students)
Next, we employ an enhanced version of Low-Rank Adaptation (LoRA) to distill knowledge into the student in a few-shot scenario.
arXiv Detail & Related papers (2024-04-14T18:57:38Z) - Attention-Guided Masked Autoencoders For Learning Image Representations [16.257915216763692]
Masked autoencoders (MAEs) have established themselves as a powerful method for unsupervised pre-training for computer vision tasks.
We propose to inform the reconstruction process through an attention-guided loss function.
Our evaluations show that our pre-trained models learn better latent representations than the vanilla MAE.
arXiv Detail & Related papers (2024-02-23T08:11:25Z) - Learning to Mask and Permute Visual Tokens for Vision Transformer
Pre-Training [59.923672191632065]
We propose a new self-supervised pre-training approach, named Masked and Permuted Vision Transformer (MaPeT)
MaPeT employs autoregressive and permuted predictions to capture intra-patch dependencies.
Our results demonstrate that MaPeT achieves competitive performance on ImageNet.
arXiv Detail & Related papers (2023-06-12T18:12:19Z) - Improving Masked Autoencoders by Learning Where to Mask [65.89510231743692]
Masked image modeling is a promising self-supervised learning method for visual data.
We present AutoMAE, a framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process.
In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.
arXiv Detail & Related papers (2023-03-12T05:28:55Z) - Mask-guided Vision Transformer (MG-ViT) for Few-Shot Learning [10.29251906347605]
We propose a novel mask-guided vision transformer (MG-ViT) to achieve an effective and efficient few-shot learning on vision transformer (ViT) model.
The MG-ViT model significantly improves the performance when compared with general fine-tuning based ViT models.
arXiv Detail & Related papers (2022-05-20T07:25:33Z) - MAT: Mask-Aware Transformer for Large Hole Image Inpainting [79.67039090195527]
We present a novel model for large hole inpainting, which unifies the merits of transformers and convolutions.
Experiments demonstrate the state-of-the-art performance of the new model on multiple benchmark datasets.
arXiv Detail & Related papers (2022-03-29T06:36:17Z) - Self-Promoted Supervision for Few-Shot Transformer [178.52948452353834]
Self-promoted sUpervisioN (SUN) is a few-shot learning framework for vision transformers (ViTs)
SUN pretrains the ViT on the few-shot learning dataset and then uses it to generate individual location-specific supervision for guiding each patch token.
Experiments show that SUN using ViTs significantly surpasses other few-shot learning frameworks with ViTs and is the first one that achieves higher performance than those CNN state-of-the-arts.
arXiv Detail & Related papers (2022-03-14T12:53:27Z) - Efficient Vision Transformers via Fine-Grained Manifold Distillation [96.50513363752836]
Vision transformer architectures have shown extraordinary performance on many computer vision tasks.
Although the network performance is boosted, transformers are often required more computational resources.
We propose to excavate useful information from the teacher transformer through the relationship between images and the divided patches.
arXiv Detail & Related papers (2021-07-03T08:28:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.