Understanding and Improving Visual Prompting: A Label-Mapping
Perspective
- URL: http://arxiv.org/abs/2211.11635v5
- Date: Fri, 24 Mar 2023 18:06:06 GMT
- Title: Understanding and Improving Visual Prompting: A Label-Mapping
Perspective
- Authors: Aochuan Chen, Yuguang Yao, Pin-Yu Chen, Yihua Zhang, Sijia Liu
- Abstract summary: We revisit and advance visual prompting (VP), an input prompting technique for vision tasks.
We propose a new VP framework, termed ILM-VP, which automatically re-maps the source labels to the target labels.
Our proposal significantly outperforms state-of-the-art VP methods.
- Score: 63.89295305670113
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We revisit and advance visual prompting (VP), an input prompting technique
for vision tasks. VP can reprogram a fixed, pre-trained source model to
accomplish downstream tasks in the target domain by simply incorporating
universal prompts (in terms of input perturbation patterns) into downstream
data points. Yet, it remains elusive why VP stays effective even given a
ruleless label mapping (LM) between the source classes and the target classes.
Inspired by the above, we ask: How is LM interrelated with VP? And how to
exploit such a relationship to improve its accuracy on target tasks? We peer
into the influence of LM on VP and provide an affirmative answer that a better
'quality' of LM (assessed by mapping precision and explanation) can
consistently improve the effectiveness of VP. This is in contrast to the prior
art where the factor of LM was missing. To optimize LM, we propose a new VP
framework, termed ILM-VP (iterative label mapping-based visual prompting),
which automatically re-maps the source labels to the target labels and
progressively improves the target task accuracy of VP. Further, when using a
contrastive language-image pretrained (CLIP) model, we propose to integrate an
LM process to assist the text prompt selection of CLIP and to improve the
target task accuracy. Extensive experiments demonstrate that our proposal
significantly outperforms state-of-the-art VP methods. As highlighted below, we
show that when reprogramming an ImageNet-pretrained ResNet-18 to 13 target
tasks, our method outperforms baselines by a substantial margin, e.g., 7.9% and
6.7% accuracy improvements in transfer learning to the target Flowers102 and
CIFAR100 datasets. Besides, our proposal on CLIP-based VP provides 13.7% and
7.1% accuracy improvements on Flowers102 and DTD respectively. Our code is
available at https://github.com/OPTML-Group/ILM-VP.
Related papers
- OT-VP: Optimal Transport-guided Visual Prompting for Test-Time Adaptation [8.425690424016986]
Vision Transformers (ViTs) have demonstrated remarkable capabilities in learning representations, but their performance is compromised when applied to unseen domains.
Our approach, Optimal Transport-guided Test-Time Visual Prompting (OT-VP), handles these problems by leveraging prompt learning at test time to align the target and source domains.
OT-VP, with only four learned prompt tokens, exceeds state-of-the-art performance across three stylistic datasets.
arXiv Detail & Related papers (2024-06-12T18:30:03Z) - AutoVP: An Automated Visual Prompting Framework and Benchmark [66.5618543577204]
Visual prompting (VP) is an emerging parameter-efficient fine-tuning approach to adapting pre-trained vision models to solve various downstream image-classification tasks.
We propose AutoVP, an end-to-end expandable framework for automating VP design choices, along with 12 downstream image-classification tasks.
Our experimental results show that AutoVP outperforms the best-known current VP methods by a substantial margin.
arXiv Detail & Related papers (2023-10-12T14:55:31Z) - ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free
Domain Adaptation [20.57370550156505]
ReCLIP is a source-free domain adaptation method for vision-language models.
We demonstrate ReCLIP reduces the average error rate of CLIP from 30.17% to 25.06% on 22 image classification benchmarks.
arXiv Detail & Related papers (2023-08-04T18:11:40Z) - Adapting Pre-trained Language Models to Vision-Language Tasks via
Dynamic Visual Prompting [83.21164539349273]
Pre-trained language models (PLMs) have played an increasing role in multimedia research.
In this paper, we focus on exploring PLMs as a stand-alone model for vision-language reasoning tasks.
We propose a novel transfer learning approach for PLMs, termed Dynamic Visual Prompting (DVP)
arXiv Detail & Related papers (2023-06-01T07:19:28Z) - Position-guided Text Prompt for Vision-Language Pre-training [121.15494549650548]
We propose a novel Position-guided Text Prompt (PTP) paradigm to enhance the visual grounding ability of cross-modal models trained with Vision-Language Pre-Training.
PTP reformulates the visual grounding task into a fill-in-the-blank problem given a PTP by encouraging the model to predict the objects in the given blocks or regress the blocks of a given object.
PTP achieves comparable results with object-detector based methods, and much faster inference speed since PTP discards its object detector for inference while the later cannot.
arXiv Detail & Related papers (2022-12-19T18:55:43Z) - Task Residual for Tuning Vision-Language Models [69.22958802711017]
We propose a new efficient tuning approach for vision-language models (VLMs) named Task Residual Tuning (TaskRes)
TaskRes explicitly decouples the prior knowledge of the pre-trained models and new knowledge regarding a target task.
The proposed TaskRes is simple yet effective, which significantly outperforms previous methods on 11 benchmark datasets.
arXiv Detail & Related papers (2022-11-18T15:09:03Z) - Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary
Object Detection [87.39089806069707]
We propose a fine-grained Visual-Text Prompt-driven self-training paradigm for Open-Vocabulary Detection (VTP-OVD)
During the adapting stage, we enable VLM to obtain fine-grained alignment by using learnable text prompts to resolve an auxiliary dense pixel-wise prediction task.
Experiments show that our method achieves the state-of-the-art performance for open-vocabulary object detection, e.g., 31.5% mAP on unseen classes of COCO.
arXiv Detail & Related papers (2022-11-02T03:38:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.