Self-Promoted Supervision for Few-Shot Transformer
- URL: http://arxiv.org/abs/2203.07057v1
- Date: Mon, 14 Mar 2022 12:53:27 GMT
- Title: Self-Promoted Supervision for Few-Shot Transformer
- Authors: Bowen Dong, Pan Zhou, Shuicheng Yan, Wangmeng Zuo
- Abstract summary: Self-promoted sUpervisioN (SUN) is a few-shot learning framework for vision transformers (ViTs)
SUN pretrains the ViT on the few-shot learning dataset and then uses it to generate individual location-specific supervision for guiding each patch token.
Experiments show that SUN using ViTs significantly surpasses other few-shot learning frameworks with ViTs and is the first one that achieves higher performance than those CNN state-of-the-arts.
- Score: 178.52948452353834
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The few-shot learning ability of vision transformers (ViTs) is rarely
investigated though heavily desired. In this work, we empirically find that
with the same few-shot learning frameworks, e.g., Meta-Baseline, replacing the
widely used CNN feature extractor with a ViT model often severely impairs
few-shot classification performance. Moreover, our empirical study shows that
in the absence of inductive bias, ViTs often learn the dependencies among input
tokens slowly under few-shot learning regime where only a few labeled training
data are available, which largely contributes to the above performance
degradation. To alleviate this issue, for the first time, we propose a simple
yet effective few-shot training framework for ViTs, namely Self-promoted
sUpervisioN (SUN). Specifically, besides the conventional global supervision
for global semantic learning, SUN further pretrains the ViT on the few-shot
learning dataset and then uses it to generate individual location-specific
supervision for guiding each patch token. This location-specific supervision
tells the ViT which patch tokens are similar or dissimilar and thus accelerates
token dependency learning. Moreover, it models the local semantics in each
patch token to improve the object grounding and recognition capability which
helps learn generalizable patterns. To improve the quality of location-specific
supervision, we further propose two techniques:~1) background patch filtration
to filtrate background patches out and assign them into an extra background
class; and 2) spatial-consistent augmentation to introduce sufficient diversity
for data augmentation while keeping the accuracy of the generated local
supervisions. Experimental results show that SUN using ViTs significantly
surpasses other few-shot learning frameworks with ViTs and is the first one
that achieves higher performance than those CNN state-of-the-arts.
Related papers
- Intra-task Mutual Attention based Vision Transformer for Few-Shot Learning [12.5354658533836]
Humans possess remarkable ability to accurately classify new, unseen images after being exposed to only a few examples.
For artificial neural network models, determining the most relevant features for distinguishing between two images with limited samples presents a challenge.
We propose an intra-task mutual attention method for few-shot learning, that involves splitting the support and query samples into patches.
arXiv Detail & Related papers (2024-05-06T02:02:57Z) - Exploring Self-Supervised Vision Transformers for Deepfake Detection: A Comparative Analysis [38.074487843137064]
This paper investigates the effectiveness of self-supervised pre-trained vision transformers (ViTs) compared to supervised pre-trained ViTs and conventional neural networks (ConvNets) for detecting facial deepfake images and videos.
It examines their potential for improved generalization and explainability, especially with limited training data.
By leveraging SSL ViTs for deepfake detection with modest data and partial fine-tuning, we find comparable adaptability to deepfake detection and explainability via the attention mechanism.
arXiv Detail & Related papers (2024-05-01T07:16:49Z) - Exploring Efficient Few-shot Adaptation for Vision Transformers [70.91692521825405]
We propose a novel efficient Transformer Tuning (eTT) method that facilitates finetuning ViTs in the Few-shot Learning tasks.
Key novelties come from the newly presented Attentive Prefix Tuning (APT) and Domain Residual Adapter (DRA)
We conduct extensive experiments to show the efficacy of our model.
arXiv Detail & Related papers (2023-01-06T08:42:05Z) - Pretraining the Vision Transformer using self-supervised methods for
vision based Deep Reinforcement Learning [0.0]
We study pretraining a Vision Transformer using several state-of-the-art self-supervised methods and assess the quality of the learned representations.
Our results show that all methods are effective in learning useful representations and avoiding representational collapse.
The encoder pretrained with the temporal order verification task shows the best results across all experiments.
arXiv Detail & Related papers (2022-09-22T10:18:59Z) - Patch-level Representation Learning for Self-supervised Vision
Transformers [68.8862419248863]
Vision Transformers (ViTs) have gained much attention recently as a better architectural choice, often outperforming convolutional networks for various visual tasks.
Inspired by this, we design a simple yet effective visual pretext task, coined SelfPatch, for learning better patch-level representations.
We demonstrate that SelfPatch can significantly improve the performance of existing SSL methods for various visual tasks.
arXiv Detail & Related papers (2022-06-16T08:01:19Z) - Where are my Neighbors? Exploiting Patches Relations in Self-Supervised
Vision Transformer [3.158346511479111]
We propose a simple but still effective self-supervised learning (SSL) strategy to train Vision Transformers (ViTs)
We define a set of SSL tasks based on relations of image patches that the model has to solve before or jointly during the downstream training.
Our RelViT model optimize all the output tokens of the transformer encoder that are related to the image patches, thus exploiting more training signal at each training step.
arXiv Detail & Related papers (2022-06-01T13:25:32Z) - The Principle of Diversity: Training Stronger Vision Transformers Calls
for Reducing All Levels of Redundancy [111.49944789602884]
This paper systematically studies the ubiquitous existence of redundancy at all three levels: patch embedding, attention map, and weight space.
We propose corresponding regularizers that encourage the representation diversity and coverage at each of those levels, that enabling capturing more discriminative information.
arXiv Detail & Related papers (2022-03-12T04:48:12Z) - Refiner: Refining Self-attention for Vision Transformers [85.80887884154427]
Vision Transformers (ViTs) have shown competitive accuracy in image classification tasks compared with CNNs.
We introduce a conceptually simple scheme, called refiner, to directly refine the self-attention maps of ViTs.
refiner applies convolutions to augment local patterns of the attention maps, which we show is equivalent to a distributed local attention features are aggregated locally with learnable kernels and then globally aggregated with self-attention.
arXiv Detail & Related papers (2021-06-07T15:24:54Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.