Hint-Aug: Drawing Hints from Foundation Vision Transformers Towards
Boosted Few-Shot Parameter-Efficient Tuning
- URL: http://arxiv.org/abs/2304.12520v3
- Date: Mon, 26 Jun 2023 06:01:14 GMT
- Title: Hint-Aug: Drawing Hints from Foundation Vision Transformers Towards
Boosted Few-Shot Parameter-Efficient Tuning
- Authors: Zhongzhi Yu, Shang Wu, Yonggan Fu, Shunyao Zhang, Yingyan Lin
- Abstract summary: We propose a framework called Hint-based Data Augmentation (Hint-Aug)
It aims to boost foundation vision transformers (FViTs) in few-shot tuning by augmenting the over-fitted parts of tuning samples with the learned features of pretrained FViTs.
Extensive experiments and ablation studies on five datasets and three parameter-efficient tuning techniques consistently validate Hint-Aug's effectiveness.
- Score: 22.0296008705388
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the growing demand for tuning foundation vision transformers (FViTs)
on downstream tasks, fully unleashing FViTs' potential under data-limited
scenarios (e.g., few-shot tuning) remains a challenge due to FViTs' data-hungry
nature. Common data augmentation techniques fall short in this context due to
the limited features contained in the few-shot tuning data. To tackle this
challenge, we first identify an opportunity for FViTs in few-shot tuning:
pretrained FViTs themselves have already learned highly representative features
from large-scale pretraining data, which are fully preserved during widely used
parameter-efficient tuning. We thus hypothesize that leveraging those learned
features to augment the tuning data can boost the effectiveness of few-shot
FViT tuning. To this end, we propose a framework called Hint-based Data
Augmentation (Hint-Aug), which aims to boost FViT in few-shot tuning by
augmenting the over-fitted parts of tuning samples with the learned features of
pretrained FViTs. Specifically, Hint-Aug integrates two key enablers: (1) an
Attentive Over-fitting Detector (AOD) to detect over-confident patches of
foundation ViTs for potentially alleviating their over-fitting on the few-shot
tuning data and (2) a Confusion-based Feature Infusion (CFI) module to infuse
easy-to-confuse features from the pretrained FViTs with the over-confident
patches detected by the above AOD in order to enhance the feature diversity
during tuning. Extensive experiments and ablation studies on five datasets and
three parameter-efficient tuning techniques consistently validate Hint-Aug's
effectiveness: 0.04% ~ 32.91% higher accuracy over the state-of-the-art (SOTA)
data augmentation method under various low-shot settings. For example, on the
Pet dataset, Hint-Aug achieves a 2.22% higher accuracy with 50% less training
data over SOTA data augmentation methods.
Related papers
- Visual Fourier Prompt Tuning [63.66866445034855]
We propose the Visual Fourier Prompt Tuning (VFPT) method as a general and effective solution for adapting large-scale transformer-based models.
Our approach incorporates the Fast Fourier Transform into prompt embeddings and harmoniously considers both spatial and frequency domain information.
Our results demonstrate that our approach outperforms current state-of-the-art baselines on two benchmarks.
arXiv Detail & Related papers (2024-11-02T18:18:35Z) - VeCAF: Vision-language Collaborative Active Finetuning with Training Objective Awareness [56.87603097348203]
VeCAF uses labels and natural language annotations to perform parametric data selection for PVM finetuning.
VeCAF incorporates the finetuning objective to select significant data points that effectively guide the PVM towards faster convergence.
On ImageNet, VeCAF uses up to 3.3x less training batches to reach the target performance compared to full finetuning.
arXiv Detail & Related papers (2024-01-15T17:28:37Z) - Denoising Vision Transformers [43.03068202384091]
We propose a two-stage denoising approach, termed Denoising Vision Transformers (DVT)
In the first stage, we separate the clean features from those contaminated by positional artifacts by enforcing cross-view feature consistency with neural fields on a per-image basis.
In the second stage, we train a lightweight transformer block to predict clean features from raw ViT outputs, leveraging the derived estimates of the clean features as supervision.
arXiv Detail & Related papers (2024-01-05T18:59:52Z) - Bridging Sensor Gaps via Attention Gated Tuning for Hyperspectral Image Classification [9.82907639745345]
HSI classification methods require high-quality labeled HSIs, which are often costly to obtain.
We propose a novel Attention-Gated Tuning (AGT) strategy and a triplet-structured transformer model, Tri-Former, to address this issue.
arXiv Detail & Related papers (2023-09-22T13:39:24Z) - Generalized Face Forgery Detection via Adaptive Learning for Pre-trained Vision Transformer [54.32283739486781]
We present a textbfForgery-aware textbfAdaptive textbfVision textbfTransformer (FA-ViT) under the adaptive learning paradigm.
FA-ViT achieves 93.83% and 78.32% AUC scores on Celeb-DF and DFDC datasets in the cross-dataset evaluation.
arXiv Detail & Related papers (2023-09-20T06:51:11Z) - Diverse Data Augmentation with Diffusions for Effective Test-time Prompt
Tuning [73.75282761503581]
We propose DiffTPT, which leverages pre-trained diffusion models to generate diverse and informative new data.
Our experiments on test datasets with distribution shifts and unseen categories demonstrate that DiffTPT improves the zero-shot accuracy by an average of 5.13%.
arXiv Detail & Related papers (2023-08-11T09:36:31Z) - Sensitivity-Aware Visual Parameter-Efficient Fine-Tuning [91.5113227694443]
We propose a novel visual.
sensuous-aware fine-Tuning (SPT) scheme.
SPT allocates trainable parameters to task-specific important positions.
Experiments on a wide range of downstream recognition tasks show that our SPT is complementary to the existing PEFT methods.
arXiv Detail & Related papers (2023-03-15T12:34:24Z) - FedSDG-FS: Efficient and Secure Feature Selection for Vertical Federated
Learning [21.79965380400454]
Vertical Learning (VFL) enables multiple data owners, each holding a different subset of features about largely overlapping sets of data sample(s) to jointly train a useful global model.
Feature selection (FS) is important to VFL. It is still an open research problem as existing FS works designed for VFL either assumes prior knowledge on the number of noisy features or prior knowledge on the post-training threshold of useful features.
We propose the Federated Dual-Gate based Feature Selection (FedSDG-FS) approach. It consists of a Gaussian dual-gate to efficiently approximate the probability of a feature being selected, with privacy
arXiv Detail & Related papers (2023-02-21T03:09:45Z) - AU-Aware Vision Transformers for Biased Facial Expression Recognition [17.00557858587472]
We experimentally show that the naive joint training of multiple FER datasets is harmful to the FER performance of individual datasets.
We propose a simple yet conceptually-new framework, AU-aware Vision Transformer (AU-ViT)
Our AU-ViT achieves state-of-the-art performance on three popular datasets, namely 91.10% on RAF-DB, 65.59% on AffectNet, and 90.15% on FERPlus.
arXiv Detail & Related papers (2022-11-12T08:58:54Z) - CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models [101.5066760592534]
We present Cross-modal Prompt Tuning (CPT), a novel paradigm for tuning Vision-Language Models (VL-PTMs)
CPT reformulates visual grounding into a fill-in-the-blank problem with color-based co-referential markers in image and text, maximally mitigating the gap.
Comprehensive experimental results show that prompt tuned VL-PTMs outperform their fine-tuned counterparts by a large margin.
arXiv Detail & Related papers (2021-09-24T08:07:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.