Unleashing the Power of Visual Prompting At the Pixel Level
- URL: http://arxiv.org/abs/2212.10556v2
- Date: Wed, 29 Mar 2023 06:49:51 GMT
- Title: Unleashing the Power of Visual Prompting At the Pixel Level
- Authors: Junyang Wu, Xianhang Li, Chen Wei, Huiyu Wang, Alan Yuille, Yuyin
Zhou, Cihang Xie
- Abstract summary: We show that the strategy of reconciling the prompt and the image matters, and find that warping the prompt around a properly shrinked image empirically works the best.
Using a CLIP model, our prompting method sets a new record of 82.8% average accuracy across 12 popular classification datasets.
- Score: 28.50538386115006
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a simple and effective visual prompting method for
adapting pre-trained models to downstream recognition tasks. Our method
includes two key designs. First, rather than directly adding together the
prompt and the image, we treat the prompt as an extra and independent learnable
component. We show that the strategy of reconciling the prompt and the image
matters, and find that warping the prompt around a properly shrinked image
empirically works the best. Second, we re-introduce two "old tricks" commonly
used in building transferable adversarial examples, i.e., input diversity and
gradient normalization, into visual prompting. These techniques improve
optimization and enable the prompt to generalize better. We provide extensive
experimental results to demonstrate the effectiveness of our method. Using a
CLIP model, our prompting method sets a new record of 82.8% average accuracy
across 12 popular classification datasets, substantially surpassing the prior
art by +5.6%. It is worth noting that this prompting performance already
outperforms linear probing by +2.1% and can even match fully fine-tuning in
certain datasets. In addition, our prompting method shows competitive
performance across different data scales and against distribution shifts. The
code is publicly available at https://github.com/UCSC-VLAA/EVP.
Related papers
- LoR-VP: Low-Rank Visual Prompting for Efficient Vision Model Adaptation [41.77434289193232]
We propose a novel visual prompt design, introducing Low-Rank matrix multiplication for Visual Prompting (LoR-VP)
LoR-VP enables shared and patch-specific information across rows and columns of image pixels.
Experiments demonstrate significant improvements in both performance and efficiency compared to state-of-the-art visual prompting methods.
arXiv Detail & Related papers (2025-02-02T20:10:48Z) - Enhancing Zero-Shot Vision Models by Label-Free Prompt Distribution Learning and Bias Correcting [55.361337202198925]
Vision-language models, such as CLIP, have shown impressive generalization capacities when using appropriate text descriptions.
We propose a label-Free prompt distribution learning and bias correction framework, dubbed as **Frolic**, which boosts zero-shot performance without the need for labeled data.
arXiv Detail & Related papers (2024-10-25T04:00:45Z) - When Does Visual Prompting Outperform Linear Probing for Vision-Language Models? A Likelihood Perspective [57.05315507519704]
We propose a log-likelihood ratio (LLR) approach to analyze the comparative benefits of visual prompting and linear probing.
Our measure attains up to a 100-fold reduction in run time compared to full training, while achieving prediction accuracies up to 91%.
arXiv Detail & Related papers (2024-09-03T12:03:45Z) - COMMA: Co-Articulated Multi-Modal Learning [39.778958624066185]
We propose Co-Articulated Multi-Modal Learning (COMMA) to handle the limitations of previous methods.
Our method considers prompts from both branches to generate the prompts to enhance the representation alignment of both branches.
We evaluate our method across three representative tasks of generalization to novel classes, new target datasets and unseen domain shifts.
arXiv Detail & Related papers (2023-12-30T15:47:36Z) - Iterative Prompt Learning for Unsupervised Backlit Image Enhancement [86.90993077000789]
We propose a novel unsupervised backlit image enhancement method, abbreviated as CLIP-LIT.
We show that the open-world CLIP prior aids in distinguishing between backlit and well-lit images.
Our method alternates between updating the prompt learning framework and enhancement network until visually pleasing results are achieved.
arXiv Detail & Related papers (2023-03-30T17:37:14Z) - Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models [52.3032592038514]
We propose a class-aware text prompt to enrich generated prompts with label-related image information.
We achieve an average improvement of 4.03% on new classes and 3.19% on harmonic-mean over eleven classification benchmarks.
arXiv Detail & Related papers (2023-03-30T06:02:40Z) - Diversity-Aware Meta Visual Prompting [111.75306320834629]
We present Diversity-Aware Meta Visual Prompting(DAM-VP), an efficient prompting method for transferring pre-trained models to downstream tasks with frozen backbone.
We cluster the downstream dataset into small subsets in a diversity-strapped way, with each subset has its own prompt separately.
All the prompts are optimized with a meta-prompt, which is learned across several datasets.
arXiv Detail & Related papers (2023-03-14T17:59:59Z) - Prompt Learning with Optimal Transport for Vision-Language Models [25.928455328563402]
We learn multiple comprehensive prompts to describe diverse characteristics of categories such as intrinsic attributes or extrinsic contexts.
To solve this problem, we propose to apply optimal transport to match the vision and text modalities.
In the inner loop, we optimize the optimal transport distance to align visual features and prompts by the Sinkhorn algorithm, while in the outer loop, we learn the prompts by this distance from the supervised data.
arXiv Detail & Related papers (2022-10-03T22:21:07Z) - Expanding Language-Image Pretrained Models for General Video Recognition [136.0948049010682]
Contrastive language-image pretraining has shown great success in learning visual-textual joint representation from web-scale data.
We present a simple yet effective approach that adapts the pretrained language-image models to video recognition directly.
Our approach surpasses the current state-of-the-art methods by +7.6% and +14.9% in terms of top-1 accuracy under two popular protocols.
arXiv Detail & Related papers (2022-08-04T17:59:54Z) - Prompt Distribution Learning [46.46876752213575]
We present prompt distribution learning for adapting a pre-trained vision-language model to address downstream recognition tasks.
Our method not only learns low-bias prompts from a few samples but also captures the distribution of diverse prompts to handle the varying visual representations.
This prompt distribution learning is realized by an efficient approach that learns the output embeddings of prompts instead of the input embeddings.
arXiv Detail & Related papers (2022-05-06T16:22:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.