LoGoPrompt: Synthetic Text Images Can Be Good Visual Prompts for
Vision-Language Models
- URL: http://arxiv.org/abs/2309.01155v2
- Date: Fri, 22 Sep 2023 02:46:55 GMT
- Title: LoGoPrompt: Synthetic Text Images Can Be Good Visual Prompts for
Vision-Language Models
- Authors: Cheng Shi and Sibei Yang
- Abstract summary: We show that synthetic text images are good visual prompts for vision-language models!
We propose our LoGoPrompt, which reformulates the classification objective to the visual prompt selection.
Our method consistently outperforms state-of-the-art methods in few-shot learning, base-to-new generalization, and domain generalization.
- Score: 28.983503845298824
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Prompt engineering is a powerful tool used to enhance the performance of
pre-trained models on downstream tasks. For example, providing the prompt
"Let's think step by step" improved GPT-3's reasoning accuracy to 63% on
MutiArith while prompting "a photo of" filled with a class name enables CLIP to
achieve $80$\% zero-shot accuracy on ImageNet. While previous research has
explored prompt learning for the visual modality, analyzing what constitutes a
good visual prompt specifically for image recognition is limited. In addition,
existing visual prompt tuning methods' generalization ability is worse than
text-only prompting tuning. This paper explores our key insight: synthetic text
images are good visual prompts for vision-language models! To achieve that, we
propose our LoGoPrompt, which reformulates the classification objective to the
visual prompt selection and addresses the chicken-and-egg challenge of first
adding synthetic text images as class-wise visual prompts or predicting the
class first. Without any trainable visual prompt parameters, experimental
results on 16 datasets demonstrate that our method consistently outperforms
state-of-the-art methods in few-shot learning, base-to-new generalization, and
domain generalization.
Related papers
- Attention Prompting on Image for Large Vision-Language Models [63.794304207664176]
We propose a new prompting technique named Attention Prompting on Image.
We generate an attention heatmap for the input image dependent on the text query with an auxiliary model like CLIP.
Experiments on various vison-language benchmarks verify the effectiveness of our technique.
arXiv Detail & Related papers (2024-09-25T17:59:13Z) - Analogist: Out-of-the-box Visual In-Context Learning with Image Diffusion Model [25.47573567479831]
We propose a novel inference-based visual ICL approach that exploits both visual and textual prompting techniques.
Our method is out-of-the-box and does not require fine-tuning or optimization.
arXiv Detail & Related papers (2024-05-16T17:59:21Z) - TAI++: Text as Image for Multi-Label Image Classification by Co-Learning Transferable Prompt [15.259819430801402]
We propose a pseudo-visual prompt(PVP) module for implicit visual prompt tuning to address this problem.
Specifically, we first learn the pseudo-visual prompt for each category, mining diverse visual knowledge by the well-aligned space of pre-trained vision-language models.
Experimental results on VOC2007, MS-COCO, and NUSWIDE datasets demonstrate that our method can surpass state-of-the-art(SOTA) methods.
arXiv Detail & Related papers (2024-05-11T06:11:42Z) - Dynamic Prompt Optimizing for Text-to-Image Generation [63.775458908172176]
We introduce the textbfPrompt textbfAuto-textbfEditing (PAE) method to improve text-to-image generative models.
We employ an online reinforcement learning strategy to explore the weights and injection time steps of each word, leading to the dynamic fine-control prompts.
arXiv Detail & Related papers (2024-04-05T13:44:39Z) - Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models [52.3032592038514]
We propose a class-aware text prompt to enrich generated prompts with label-related image information.
We achieve an average improvement of 4.03% on new classes and 3.19% on harmonic-mean over eleven classification benchmarks.
arXiv Detail & Related papers (2023-03-30T06:02:40Z) - Optimizing Prompts for Text-to-Image Generation [97.61295501273288]
Well-designed prompts can guide text-to-image models to generate amazing images.
But the performant prompts are often model-specific and misaligned with user input.
We propose prompt adaptation, a framework that automatically adapts original user input to model-preferred prompts.
arXiv Detail & Related papers (2022-12-19T16:50:41Z) - Texts as Images in Prompt Tuning for Multi-Label Image Recognition [70.9310322461598]
We advocate that image-text contrastive learning makes it feasible to treat texts as images for prompt tuning and introduce TaI prompting.
Particularly, we apply TaI prompting to multi-label image recognition, where sentences in the wild serve as alternatives to images for prompt tuning.
Our proposed TaI-DPT outperforms zero-shot CLIP by a large margin on multiple benchmarks.
arXiv Detail & Related papers (2022-11-23T07:00:11Z) - CPL: Counterfactual Prompt Learning for Vision and Language Models [76.18024920393245]
This paper presents a novel underlinetextbfCounterfactual underlinetextbfPrompt underlinetextbfLearning (CPL) method for vision and language models.
CPL simultaneously employs counterfactual generation and contrastive learning in a joint optimization framework.
Experiments demonstrate that CPL can obtain superior few-shot performance on different vision and language tasks.
arXiv Detail & Related papers (2022-10-19T08:06:39Z) - Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model [39.722927180264584]
We propose a novel Dual-modality Prompt Tuning (DPT) paradigm through learning text and visual prompts simultaneously.
To make the final image feature concentrate more on the target visual concept, a Class-Aware Visual Prompt Tuning scheme is proposed.
arXiv Detail & Related papers (2022-08-17T15:06:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.