Exploring Efficient Few-shot Adaptation for Vision Transformers
- URL: http://arxiv.org/abs/2301.02419v1
- Date: Fri, 6 Jan 2023 08:42:05 GMT
- Title: Exploring Efficient Few-shot Adaptation for Vision Transformers
- Authors: Chengming Xu, Siqian Yang, Yabiao Wang, Zhanxiong Wang, Yanwei Fu,
Xiangyang Xue
- Abstract summary: We propose a novel efficient Transformer Tuning (eTT) method that facilitates finetuning ViTs in the Few-shot Learning tasks.
Key novelties come from the newly presented Attentive Prefix Tuning (APT) and Domain Residual Adapter (DRA)
We conduct extensive experiments to show the efficacy of our model.
- Score: 70.91692521825405
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The task of Few-shot Learning (FSL) aims to do the inference on novel
categories containing only few labeled examples, with the help of knowledge
learned from base categories containing abundant labeled training samples.
While there are numerous works into FSL task, Vision Transformers (ViTs) have
rarely been taken as the backbone to FSL with few trials focusing on naive
finetuning of whole backbone or classification layer.} Essentially, despite
ViTs have been shown to enjoy comparable or even better performance on other
vision tasks, it is still very nontrivial to efficiently finetune the ViTs in
real-world FSL scenarios. To this end, we propose a novel efficient Transformer
Tuning (eTT) method that facilitates finetuning ViTs in the FSL tasks. The key
novelties come from the newly presented Attentive Prefix Tuning (APT) and
Domain Residual Adapter (DRA) for the task and backbone tuning, individually.
Specifically, in APT, the prefix is projected to new key and value pairs that
are attached to each self-attention layer to provide the model with
task-specific information. Moreover, we design the DRA in the form of learnable
offset vectors to handle the potential domain gaps between base and novel data.
To ensure the APT would not deviate from the initial task-specific information
much, we further propose a novel prototypical regularization, which maximizes
the similarity between the projected distribution of prefix and initial
prototypes, regularizing the update procedure. Our method receives outstanding
performance on the challenging Meta-Dataset. We conduct extensive experiments
to show the efficacy of our model.
Related papers
- Heterogeneous Federated Learning with Splited Language Model [22.65325348176366]
Federated Split Learning (FSL) is a promising distributed learning paradigm in practice.
In this paper, we harness Pre-trained Image Transformers (PITs) as the initial model, coined FedV, to accelerate the training process and improve model robustness.
We are the first to provide a systematic evaluation of FSL methods with PITs in real-world datasets, different partial device participations, and heterogeneous data splits.
arXiv Detail & Related papers (2024-03-24T07:33:08Z) - Approximated Prompt Tuning for Vision-Language Pre-trained Models [54.326232586461614]
In vision-language pre-trained models, prompt tuning often requires a large number of learnable tokens to bridge the gap between the pre-training and downstream tasks.
We propose a novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer learning.
arXiv Detail & Related papers (2023-06-27T05:43:47Z) - Strong Baselines for Parameter Efficient Few-Shot Fine-tuning [50.83426196335385]
Few-shot classification (FSC) entails learning novel classes given only a few examples per class after a pre-training (or meta-training) phase.
Recent works have shown that simply fine-tuning a pre-trained Vision Transformer (ViT) on new test classes is a strong approach for FSC.
Fine-tuning ViTs, however, is expensive in time, compute and storage.
This has motivated the design of parameter efficient fine-tuning (PEFT) methods which fine-tune only a fraction of the Transformer's parameters.
arXiv Detail & Related papers (2023-04-04T16:14:39Z) - Task Residual for Tuning Vision-Language Models [69.22958802711017]
We propose a new efficient tuning approach for vision-language models (VLMs) named Task Residual Tuning (TaskRes)
TaskRes explicitly decouples the prior knowledge of the pre-trained models and new knowledge regarding a target task.
The proposed TaskRes is simple yet effective, which significantly outperforms previous methods on 11 benchmark datasets.
arXiv Detail & Related papers (2022-11-18T15:09:03Z) - Where are my Neighbors? Exploiting Patches Relations in Self-Supervised
Vision Transformer [3.158346511479111]
We propose a simple but still effective self-supervised learning (SSL) strategy to train Vision Transformers (ViTs)
We define a set of SSL tasks based on relations of image patches that the model has to solve before or jointly during the downstream training.
Our RelViT model optimize all the output tokens of the transformer encoder that are related to the image patches, thus exploiting more training signal at each training step.
arXiv Detail & Related papers (2022-06-01T13:25:32Z) - Mask-guided Vision Transformer (MG-ViT) for Few-Shot Learning [10.29251906347605]
We propose a novel mask-guided vision transformer (MG-ViT) to achieve an effective and efficient few-shot learning on vision transformer (ViT) model.
The MG-ViT model significantly improves the performance when compared with general fine-tuning based ViT models.
arXiv Detail & Related papers (2022-05-20T07:25:33Z) - Exploring Complementary Strengths of Invariant and Equivariant
Representations for Few-Shot Learning [96.75889543560497]
In many real-world problems, collecting a large number of labeled samples is infeasible.
Few-shot learning is the dominant approach to address this issue, where the objective is to quickly adapt to novel categories in presence of a limited number of samples.
We propose a novel training mechanism that simultaneously enforces equivariance and invariance to a general set of geometric transformations.
arXiv Detail & Related papers (2021-03-01T21:14:33Z) - TAFSSL: Task-Adaptive Feature Sub-Space Learning for few-shot
classification [50.358839666165764]
We show that the Task-Adaptive Feature Sub-Space Learning (TAFSSL) can significantly boost the performance in Few-Shot Learning scenarios.
Specifically, we show that on the challenging miniImageNet and tieredImageNet benchmarks, TAFSSL can improve the current state-of-the-art in both transductive and semi-supervised FSL settings by more than $5%$.
arXiv Detail & Related papers (2020-03-14T16:59:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.