Texts as Images in Prompt Tuning for Multi-Label Image Recognition
- URL: http://arxiv.org/abs/2211.12739v1
- Date: Wed, 23 Nov 2022 07:00:11 GMT
- Title: Texts as Images in Prompt Tuning for Multi-Label Image Recognition
- Authors: Zixian Guo, Bowen Dong, Zhilong Ji, Jinfeng Bai, Yiwen Guo, Wangmeng
Zuo
- Abstract summary: We advocate that image-text contrastive learning makes it feasible to treat texts as images for prompt tuning and introduce TaI prompting.
Particularly, we apply TaI prompting to multi-label image recognition, where sentences in the wild serve as alternatives to images for prompt tuning.
Our proposed TaI-DPT outperforms zero-shot CLIP by a large margin on multiple benchmarks.
- Score: 70.9310322461598
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Prompt tuning has been employed as an efficient way to adapt large
vision-language pre-trained models (e.g. CLIP) to various downstream tasks in
data-limited or label-limited settings. Nonetheless, visual data (e.g., images)
is by default prerequisite for learning prompts in existing methods. In this
work, we advocate that the effectiveness of image-text contrastive learning in
aligning the two modalities (for training CLIP) further makes it feasible to
treat texts as images for prompt tuning and introduce TaI prompting. In
contrast to the visual data, text descriptions are easy to collect, and their
class labels can be directly derived. Particularly, we apply TaI prompting to
multi-label image recognition, where sentences in the wild serve as
alternatives to images for prompt tuning. Moreover, with TaI, double-grained
prompt tuning (TaI-DPT) is further presented to extract both coarse-grained and
fine-grained embeddings for enhancing the multi-label recognition performance.
Experimental results show that our proposed TaI-DPT outperforms zero-shot CLIP
by a large margin on multiple benchmarks, e.g., MS-COCO, VOC2007, and NUS-WIDE,
while it can be combined with existing methods of prompting from images to
improve recognition performance further. Code is released at
https://github.com/guozix/TaI-DPT.
Related papers
- Text and Image Are Mutually Beneficial: Enhancing Training-Free Few-Shot Classification with CLIP [22.33658954569737]
We build a mutual guidance mechanism, that introduces an Image-Guided-Text (IGT) component and a Text-Guided-Image (TGI) component.
Extensive experiments show that TIMO significantly outperforms the state-of-the-art (SOTA) training-free method.
We propose an enhanced variant, TIMO-S, which even surpasses the best training-required methods by 0.33% with approximately 100 times less time cost.
arXiv Detail & Related papers (2024-12-16T02:03:45Z) - TAI++: Text as Image for Multi-Label Image Classification by Co-Learning Transferable Prompt [15.259819430801402]
We propose a pseudo-visual prompt(PVP) module for implicit visual prompt tuning to address this problem.
Specifically, we first learn the pseudo-visual prompt for each category, mining diverse visual knowledge by the well-aligned space of pre-trained vision-language models.
Experimental results on VOC2007, MS-COCO, and NUSWIDE datasets demonstrate that our method can surpass state-of-the-art(SOTA) methods.
arXiv Detail & Related papers (2024-05-11T06:11:42Z) - VIXEN: Visual Text Comparison Network for Image Difference Captioning [58.16313862434814]
We present VIXEN, a technique that succinctly summarizes in text the visual differences between a pair of images.
Our proposed network linearly maps image features in a pairwise manner, constructing a soft prompt for a pretrained large language model.
arXiv Detail & Related papers (2024-02-29T12:56:18Z) - Iterative Prompt Learning for Unsupervised Backlit Image Enhancement [86.90993077000789]
We propose a novel unsupervised backlit image enhancement method, abbreviated as CLIP-LIT.
We show that the open-world CLIP prior aids in distinguishing between backlit and well-lit images.
Our method alternates between updating the prompt learning framework and enhancement network until visually pleasing results are achieved.
arXiv Detail & Related papers (2023-03-30T17:37:14Z) - Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models [52.3032592038514]
We propose a class-aware text prompt to enrich generated prompts with label-related image information.
We achieve an average improvement of 4.03% on new classes and 3.19% on harmonic-mean over eleven classification benchmarks.
arXiv Detail & Related papers (2023-03-30T06:02:40Z) - CLIP-ReID: Exploiting Vision-Language Model for Image Re-Identification
without Concrete Text Labels [28.42405456691034]
We propose a two-stage strategy to facilitate a better visual representation in image re-identification tasks.
The key idea is to fully exploit the cross-modal description ability in CLIP through a set of learnable text tokens for each ID.
The effectiveness of the proposed strategy is validated on several datasets for the person or vehicle ReID tasks.
arXiv Detail & Related papers (2022-11-25T09:41:57Z) - Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model [39.722927180264584]
We propose a novel Dual-modality Prompt Tuning (DPT) paradigm through learning text and visual prompts simultaneously.
To make the final image feature concentrate more on the target visual concept, a Class-Aware Visual Prompt Tuning scheme is proposed.
arXiv Detail & Related papers (2022-08-17T15:06:36Z) - DualCoOp: Fast Adaptation to Multi-Label Recognition with Limited
Annotations [61.41339201200135]
We propose Dual Context Optimization (DualCoOp) as a unified framework for partial-label MLR and zero-shot MLR.
Since DualCoOp only introduces a very light learnable overhead upon the pretrained vision-language framework, it can quickly adapt to multi-label recognition tasks.
arXiv Detail & Related papers (2022-06-20T02:36:54Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.