Compound Text-Guided Prompt Tuning via Image-Adaptive Cues
- URL: http://arxiv.org/abs/2312.06401v1
- Date: Mon, 11 Dec 2023 14:17:02 GMT
- Title: Compound Text-Guided Prompt Tuning via Image-Adaptive Cues
- Authors: Hao Tan, Jun Li, Yizhuang Zhou, Jun Wan, Zhen Lei, Xiangyu Zhang
- Abstract summary: We propose Compound Text-Guided Prompt Tuning (TGP-T)
It significantly reduces resource demand while achieving superior performance.
It reduces GPU memory usage by 93% and attains a 2.5% performance gain on 16-shot ImageNet.
- Score: 42.248853198953945
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-Language Models (VLMs) such as CLIP have demonstrated remarkable
generalization capabilities to downstream tasks. However, existing prompt
tuning based frameworks need to parallelize learnable textual inputs for all
categories, suffering from massive GPU memory consumption when there is a large
number of categories in the target dataset. Moreover, previous works require to
include category names within prompts, exhibiting subpar performance when
dealing with ambiguous category names. To address these shortcomings, we
propose Compound Text-Guided Prompt Tuning (TGP-T) that significantly reduces
resource demand while achieving superior performance. We introduce text
supervision to the optimization of prompts, which enables two benefits: 1)
releasing the model reliance on the pre-defined category names during
inference, thereby enabling more flexible prompt generation; 2) reducing the
number of inputs to the text encoder, which decreases GPU memory consumption
significantly. Specifically, we found that compound text supervisions, i.e.,
category-wise and content-wise, is highly effective, since they provide
inter-class separability and capture intra-class variations, respectively.
Moreover, we condition the prompt generation on visual features through a
module called Bonder, which facilitates the alignment between prompts and
visual features. Extensive experiments on few-shot recognition and domain
generalization demonstrate that TGP-T achieves superior performance with
consistently lower training costs. It reduces GPU memory usage by 93% and
attains a 2.5% performance gain on 16-shot ImageNet. The code is available at
https://github.com/EricTan7/TGP-T.
Related papers
- SDVPT: Semantic-Driven Visual Prompt Tuning for Open-World Object Counting [70.49268117587562]
We propose a plug-and-play Semantic-Driven Visual Prompt Tuning framework (SDVPT) that transfers knowledge from the training set to unseen categories.
During inference, we dynamically synthesize the visual prompts for unseen categories based on the semantic correlation between unseen and training categories.
arXiv Detail & Related papers (2025-04-24T09:31:08Z) - IntCoOp: Interpretability-Aware Vision-Language Prompt Tuning [94.52149969720712]
IntCoOp learns to jointly align attribute-level inductive biases and class embeddings during prompt-tuning.
IntCoOp improves CoOp by 7.35% in average performance across 10 diverse datasets.
arXiv Detail & Related papers (2024-06-19T16:37:31Z) - Can Better Text Semantics in Prompt Tuning Improve VLM Generalization? [28.041879000565874]
We introduce a prompt-tuning method that leverages class descriptions obtained from Large Language Models.
Our approach constructs part-level description-guided image and text features, which are subsequently aligned to learn more generalizable prompts.
Our comprehensive experiments conducted across 11 benchmark datasets show that our method outperforms established methods.
arXiv Detail & Related papers (2024-05-13T16:52:17Z) - Visual Prompting for Generalized Few-shot Segmentation: A Multi-scale Approach [29.735863112700358]
We study the effectiveness of prompting a transformer-decoder with learned visual prompts for the generalized few-shot segmentation (GFSS) task.
Our goal is to achieve strong performance not only on novel categories with limited examples, but also to retain performance on base categories.
We introduce a unidirectional causal attention mechanism between the novel prompts, learned with limited examples, and the base prompts, learned with abundant data.
arXiv Detail & Related papers (2024-04-17T20:35:00Z) - GraphAdapter: Tuning Vision-Language Models With Dual Knowledge Graph [63.81641578763094]
adapter-style efficient transfer learning (ETL) has shown excellent performance in the tuning of vision-language models (VLMs)
We propose an effective adapter-style tuning strategy, dubbed GraphAdapter, which performs the textual adapter by explicitly modeling the dual-modality structure knowledge.
In particular, the dual knowledge graph is established with two sub-graphs, i.e., a textual knowledge sub-graph, and a visual knowledge sub-graph, where the nodes and edges represent the semantics/classes and their correlations in two modalities, respectively.
arXiv Detail & Related papers (2023-09-24T12:56:40Z) - PVPUFormer: Probabilistic Visual Prompt Unified Transformer for Interactive Image Segmentation [28.033243651780214]
This paper proposes a simple yet effective Probabilistic Visual Prompt Unified Transformer (PVPUFormer) for interactive image segmentation.
We first propose a Probabilistic Prompt-unified (PPuE) to generate a unified one-dimensional vector by exploring both prompt and non-prompt information.
We then present a Prompt-to-Pixel Contrastive (P$2$C) loss to accurately align both prompt and pixel features, bridging the representation gap between them.
arXiv Detail & Related papers (2023-06-11T12:00:33Z) - Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models [52.3032592038514]
We propose a class-aware text prompt to enrich generated prompts with label-related image information.
We achieve an average improvement of 4.03% on new classes and 3.19% on harmonic-mean over eleven classification benchmarks.
arXiv Detail & Related papers (2023-03-30T06:02:40Z) - Texts as Images in Prompt Tuning for Multi-Label Image Recognition [70.9310322461598]
We advocate that image-text contrastive learning makes it feasible to treat texts as images for prompt tuning and introduce TaI prompting.
Particularly, we apply TaI prompting to multi-label image recognition, where sentences in the wild serve as alternatives to images for prompt tuning.
Our proposed TaI-DPT outperforms zero-shot CLIP by a large margin on multiple benchmarks.
arXiv Detail & Related papers (2022-11-23T07:00:11Z) - LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of
Vision & Language Models [67.19124099815645]
We propose a novel Language-Aware Soft Prompting (LASP) learning method to alleviate base class overfitting.
LASP is inherently amenable to including, during training, virtual classes, i.e. class names for which no visual samples are available.
LASP matches and surpasses, for the first time, the accuracy on novel classes obtained by hand-crafted prompts and CLIP for 8 out of 11 test datasets.
arXiv Detail & Related papers (2022-10-03T17:56:35Z) - Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model [39.722927180264584]
We propose a novel Dual-modality Prompt Tuning (DPT) paradigm through learning text and visual prompts simultaneously.
To make the final image feature concentrate more on the target visual concept, a Class-Aware Visual Prompt Tuning scheme is proposed.
arXiv Detail & Related papers (2022-08-17T15:06:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.