MVP: Meta Visual Prompt Tuning for Few-Shot Remote Sensing Image Scene
Classification
- URL: http://arxiv.org/abs/2309.09276v1
- Date: Sun, 17 Sep 2023 13:51:05 GMT
- Title: MVP: Meta Visual Prompt Tuning for Few-Shot Remote Sensing Image Scene
Classification
- Authors: Junjie Zhu, Yiying Li, Chunping Qiu, Ke Yang, Naiyang Guan, Xiaodong
Yi
- Abstract summary: PMF has achieved promising results in few-shot image classification by utilizing pre-trained vision transformer models.
We propose the Meta Visual Prompt Tuning (MVP) method, which updates only the newly added prompt parameters while keeping the pre-trained backbone frozen.
We introduce a novel data augmentation strategy based on patch embedding recombination to enhance the representation and diversity of scenes for classification purposes.
- Score: 15.780372479483235
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformer (ViT) models have recently emerged as powerful and
versatile models for various visual tasks. Recently, a work called PMF has
achieved promising results in few-shot image classification by utilizing
pre-trained vision transformer models. However, PMF employs full fine-tuning
for learning the downstream tasks, leading to significant overfitting and
storage issues, especially in the remote sensing domain. In order to tackle
these issues, we turn to the recently proposed parameter-efficient tuning
methods, such as VPT, which updates only the newly added prompt parameters
while keeping the pre-trained backbone frozen. Inspired by VPT, we propose the
Meta Visual Prompt Tuning (MVP) method. Specifically, we integrate the VPT
method into the meta-learning framework and tailor it to the remote sensing
domain, resulting in an efficient framework for Few-Shot Remote Sensing Scene
Classification (FS-RSSC). Furthermore, we introduce a novel data augmentation
strategy based on patch embedding recombination to enhance the representation
and diversity of scenes for classification purposes. Experiment results on the
FS-RSSC benchmark demonstrate the superior performance of the proposed MVP over
existing methods in various settings, such as various-way-various-shot,
various-way-one-shot, and cross-domain adaptation.
Related papers
- Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation [67.13876021157887]
Dynamic Tuning (DyT) is a novel approach to improve both parameter and inference efficiency for ViT adaptation.
DyT achieves superior performance compared to existing PEFT methods while evoking only 71% of their FLOPs on the VTAB-1K benchmark.
arXiv Detail & Related papers (2024-03-18T14:05:52Z) - FLIP: Cross-domain Face Anti-spoofing with Language Guidance [19.957293190322332]
Face anti-spoofing (FAS) or presentation attack detection is an essential component of face recognition systems.
Recent vision transformer (ViT) models have been shown to be effective for the FAS task.
We propose a novel approach for robust cross-domain FAS by grounding visual representations with the help of natural language.
arXiv Detail & Related papers (2023-09-28T17:53:20Z) - Dynamic Visual Prompt Tuning for Parameter Efficient Transfer Learning [0.8430481660019451]
We propose a Dynamic Visual Prompt Tuning framework (DVPT), which can generate a dynamic instance-wise token for each image.
In this way, it can capture the unique visual feature of each image, which can be more suitable for downstream visual tasks.
Experiments on a wide range of downstream recognition tasks show that DVPT achieves superior performance than other PETL methods.
arXiv Detail & Related papers (2023-09-12T10:47:37Z) - M2Former: Multi-Scale Patch Selection for Fine-Grained Visual
Recognition [4.621578854541836]
We propose multi-scale patch selection (MSPS) to improve the multi-scale capabilities of existing ViT-based models.
Specifically, MSPS selects salient patches of different scales at different stages of a vision Transformer (MS-ViT)
In addition, we introduce class token transfer (CTT) and multi-scale cross-attention (MSCA) to model cross-scale interactions between selected multi-scale patches and fully reflect them in model decisions.
arXiv Detail & Related papers (2023-08-04T06:41:35Z) - Instruction-ViT: Multi-Modal Prompts for Instruction Learning in ViT [58.70209492842953]
In this paper, we focus on adapting prompt design based on instruction tuning into a visual transformer model for image classification.
The key idea is to implement multi-modal prompts related to category information to guide the fine-tuning of the model.
Based on the experiments of several image captionining tasks, the performance and domain adaptability were improved.
arXiv Detail & Related papers (2023-04-29T08:59:12Z) - Strong Baselines for Parameter Efficient Few-Shot Fine-tuning [50.83426196335385]
Few-shot classification (FSC) entails learning novel classes given only a few examples per class after a pre-training (or meta-training) phase.
Recent works have shown that simply fine-tuning a pre-trained Vision Transformer (ViT) on new test classes is a strong approach for FSC.
Fine-tuning ViTs, however, is expensive in time, compute and storage.
This has motivated the design of parameter efficient fine-tuning (PEFT) methods which fine-tune only a fraction of the Transformer's parameters.
arXiv Detail & Related papers (2023-04-04T16:14:39Z) - Exploring Efficient Few-shot Adaptation for Vision Transformers [70.91692521825405]
We propose a novel efficient Transformer Tuning (eTT) method that facilitates finetuning ViTs in the Few-shot Learning tasks.
Key novelties come from the newly presented Attentive Prefix Tuning (APT) and Domain Residual Adapter (DRA)
We conduct extensive experiments to show the efficacy of our model.
arXiv Detail & Related papers (2023-01-06T08:42:05Z) - Towards a Unified View on Visual Parameter-Efficient Transfer Learning [96.99924127527002]
We propose a framework with a unified view called visual-PETL (V-PETL) to investigate the different aspects affecting the trade-off.
An effective scheme Swin-BAPAT derived from the proposed V-PETL framework achieves significantly better performance than the state-of-the-art AdaptFormer-Swin.
arXiv Detail & Related papers (2022-10-03T09:54:39Z) - Adaptive Transformers for Robust Few-shot Cross-domain Face
Anti-spoofing [71.06718651013965]
We present adaptive vision transformers (ViT) for robust cross-domain face antispoofing.
We adopt ViT as a backbone to exploit its strength to account for long-range dependencies among pixels.
Experiments on several benchmark datasets show that the proposed models achieve both robust and competitive performance.
arXiv Detail & Related papers (2022-03-23T03:37:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.