Mask-guided Vision Transformer (MG-ViT) for Few-Shot Learning
- URL: http://arxiv.org/abs/2205.09995v1
- Date: Fri, 20 May 2022 07:25:33 GMT
- Title: Mask-guided Vision Transformer (MG-ViT) for Few-Shot Learning
- Authors: Yuzhong Chen, Zhenxiang Xiao, Lin Zhao, Lu Zhang, Haixing Dai, David
Weizhong Liu, Zihao Wu, Changhe Li, Tuo Zhang, Changying Li, Dajiang Zhu,
Tianming Liu, Xi Jiang
- Abstract summary: We propose a novel mask-guided vision transformer (MG-ViT) to achieve an effective and efficient few-shot learning on vision transformer (ViT) model.
The MG-ViT model significantly improves the performance when compared with general fine-tuning based ViT models.
- Score: 10.29251906347605
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning with little data is challenging but often inevitable in various
application scenarios where the labeled data is limited and costly. Recently,
few-shot learning (FSL) gained increasing attention because of its
generalizability of prior knowledge to new tasks that contain only a few
samples. However, for data-intensive models such as vision transformer (ViT),
current fine-tuning based FSL approaches are inefficient in knowledge
generalization and thus degenerate the downstream task performances. In this
paper, we propose a novel mask-guided vision transformer (MG-ViT) to achieve an
effective and efficient FSL on ViT model. The key idea is to apply a mask on
image patches to screen out the task-irrelevant ones and to guide the ViT to
focus on task-relevant and discriminative patches during FSL. Particularly,
MG-ViT only introduces an additional mask operation and a residual connection,
enabling the inheritance of parameters from pre-trained ViT without any other
cost. To optimally select representative few-shot samples, we also include an
active learning based sample selection method to further improve the
generalizability of MG-ViT based FSL. We evaluate the proposed MG-ViT on both
Agri-ImageNet classification task and ACFR apple detection task with
gradient-weighted class activation mapping (Grad-CAM) as the mask. The
experimental results show that the MG-ViT model significantly improves the
performance when compared with general fine-tuning based ViT models, providing
novel insights and a concrete approach towards generalizing data-intensive and
large-scale deep learning models for FSL.
Related papers
- Learning Semantic Proxies from Visual Prompts for Parameter-Efficient Fine-Tuning in Deep Metric Learning [13.964106147449051]
Existing solutions concentrate on fine-tuning the pre-trained models on conventional image datasets.
We propose a novel and effective framework based on learning Visual Prompts (VPT) in the pre-trained Vision Transformers (ViT)
We demonstrate that our new approximations with semantic information are superior to representative capabilities.
arXiv Detail & Related papers (2024-02-04T04:42:05Z) - p-Laplacian Adaptation for Generative Pre-trained Vision-Language Models [10.713680139939354]
Vision-Language models (VLMs) pre-trained on large corpora have demonstrated notable success across a range of downstream tasks.
PETL has garnered attention as a viable alternative to full fine-tuning.
We propose a new adapter architecture, $p$-adapter, which employs $p$-Laplacian message passing in Graph Neural Networks (GNNs)
arXiv Detail & Related papers (2023-12-17T05:30:35Z) - Approximated Prompt Tuning for Vision-Language Pre-trained Models [54.326232586461614]
In vision-language pre-trained models, prompt tuning often requires a large number of learnable tokens to bridge the gap between the pre-training and downstream tasks.
We propose a novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer learning.
arXiv Detail & Related papers (2023-06-27T05:43:47Z) - Learning to Mask and Permute Visual Tokens for Vision Transformer
Pre-Training [59.923672191632065]
We propose a new self-supervised pre-training approach, named Masked and Permuted Vision Transformer (MaPeT)
MaPeT employs autoregressive and permuted predictions to capture intra-patch dependencies.
Our results demonstrate that MaPeT achieves competitive performance on ImageNet.
arXiv Detail & Related papers (2023-06-12T18:12:19Z) - Supervised Masked Knowledge Distillation for Few-Shot Transformers [36.46755346410219]
We propose a novel Supervised Masked Knowledge Distillation model (SMKD) for few-shot Transformers.
Compared with previous self-supervised methods, we allow intra-class knowledge distillation on both class and patch tokens.
Our method with simple design outperforms previous methods by a large margin and achieves a new start-of-the-art.
arXiv Detail & Related papers (2023-03-25T03:31:46Z) - Exploring Efficient Few-shot Adaptation for Vision Transformers [70.91692521825405]
We propose a novel efficient Transformer Tuning (eTT) method that facilitates finetuning ViTs in the Few-shot Learning tasks.
Key novelties come from the newly presented Attentive Prefix Tuning (APT) and Domain Residual Adapter (DRA)
We conduct extensive experiments to show the efficacy of our model.
arXiv Detail & Related papers (2023-01-06T08:42:05Z) - Task Residual for Tuning Vision-Language Models [69.22958802711017]
We propose a new efficient tuning approach for vision-language models (VLMs) named Task Residual Tuning (TaskRes)
TaskRes explicitly decouples the prior knowledge of the pre-trained models and new knowledge regarding a target task.
The proposed TaskRes is simple yet effective, which significantly outperforms previous methods on 11 benchmark datasets.
arXiv Detail & Related papers (2022-11-18T15:09:03Z) - Where are my Neighbors? Exploiting Patches Relations in Self-Supervised
Vision Transformer [3.158346511479111]
We propose a simple but still effective self-supervised learning (SSL) strategy to train Vision Transformers (ViTs)
We define a set of SSL tasks based on relations of image patches that the model has to solve before or jointly during the downstream training.
Our RelViT model optimize all the output tokens of the transformer encoder that are related to the image patches, thus exploiting more training signal at each training step.
arXiv Detail & Related papers (2022-06-01T13:25:32Z) - Self-Promoted Supervision for Few-Shot Transformer [178.52948452353834]
Self-promoted sUpervisioN (SUN) is a few-shot learning framework for vision transformers (ViTs)
SUN pretrains the ViT on the few-shot learning dataset and then uses it to generate individual location-specific supervision for guiding each patch token.
Experiments show that SUN using ViTs significantly surpasses other few-shot learning frameworks with ViTs and is the first one that achieves higher performance than those CNN state-of-the-arts.
arXiv Detail & Related papers (2022-03-14T12:53:27Z) - Benchmarking Detection Transfer Learning with Vision Transformers [60.97703494764904]
complexity of object detection methods can make benchmarking non-trivial when new architectures, such as Vision Transformer (ViT) models, arrive.
We present training techniques that overcome these challenges, enabling the use of standard ViT models as the backbone of Mask R-CNN.
Our results show that recent masking-based unsupervised learning methods may, for the first time, provide convincing transfer learning improvements on COCO.
arXiv Detail & Related papers (2021-11-22T18:59:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.