TAI++: Text as Image for Multi-Label Image Classification by Co-Learning Transferable Prompt
- URL: http://arxiv.org/abs/2405.06926v1
- Date: Sat, 11 May 2024 06:11:42 GMT
- Title: TAI++: Text as Image for Multi-Label Image Classification by Co-Learning Transferable Prompt
- Authors: Xiangyu Wu, Qing-Yuan Jiang, Yang Yang, Yi-Feng Wu, Qing-Guo Chen, Jianfeng Lu,
- Abstract summary: We propose a pseudo-visual prompt(PVP) module for implicit visual prompt tuning to address this problem.
Specifically, we first learn the pseudo-visual prompt for each category, mining diverse visual knowledge by the well-aligned space of pre-trained vision-language models.
Experimental results on VOC2007, MS-COCO, and NUSWIDE datasets demonstrate that our method can surpass state-of-the-art(SOTA) methods.
- Score: 15.259819430801402
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recent introduction of prompt tuning based on pre-trained vision-language models has dramatically improved the performance of multi-label image classification. However, some existing strategies that have been explored still have drawbacks, i.e., either exploiting massive labeled visual data at a high cost or using text data only for text prompt tuning and thus failing to learn the diversity of visual knowledge. Hence, the application scenarios of these methods are limited. In this paper, we propose a pseudo-visual prompt~(PVP) module for implicit visual prompt tuning to address this problem. Specifically, we first learn the pseudo-visual prompt for each category, mining diverse visual knowledge by the well-aligned space of pre-trained vision-language models. Then, a co-learning strategy with a dual-adapter module is designed to transfer visual knowledge from pseudo-visual prompt to text prompt, enhancing their visual representation abilities. Experimental results on VOC2007, MS-COCO, and NUSWIDE datasets demonstrate that our method can surpass state-of-the-art~(SOTA) methods across various settings for multi-label image classification tasks. The code is available at https://github.com/njustkmg/PVP.
Related papers
- The Solution for Language-Enhanced Image New Category Discovery [5.500122875523184]
We propose reversing the training process of CLIP and introducing the concept of Pseudo Visual Prompts.
These prompts are for each object category and pre-trained on large-scale, low-cost sentence data generated by large language models.
We then employ contrastive learning to transfer the stored visual information to the textual labels, enhancing their visual representation capacity.
arXiv Detail & Related papers (2024-07-06T08:09:29Z) - ProGEO: Generating Prompts through Image-Text Contrastive Learning for Visual Geo-localization [0.0]
We propose a two-stage training method to enhance visual performance and use contrastive learning to mine challenging samples.
We validate the effectiveness of the proposed strategy on several large-scale visual geo-localization datasets.
arXiv Detail & Related papers (2024-06-04T02:28:51Z) - VIXEN: Visual Text Comparison Network for Image Difference Captioning [58.16313862434814]
We present VIXEN, a technique that succinctly summarizes in text the visual differences between a pair of images.
Our proposed network linearly maps image features in a pairwise manner, constructing a soft prompt for a pretrained large language model.
arXiv Detail & Related papers (2024-02-29T12:56:18Z) - Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models [52.3032592038514]
We propose a class-aware text prompt to enrich generated prompts with label-related image information.
We achieve an average improvement of 4.03% on new classes and 3.19% on harmonic-mean over eleven classification benchmarks.
arXiv Detail & Related papers (2023-03-30T06:02:40Z) - SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for
Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models.
SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation.
State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z) - Texts as Images in Prompt Tuning for Multi-Label Image Recognition [70.9310322461598]
We advocate that image-text contrastive learning makes it feasible to treat texts as images for prompt tuning and introduce TaI prompting.
Particularly, we apply TaI prompting to multi-label image recognition, where sentences in the wild serve as alternatives to images for prompt tuning.
Our proposed TaI-DPT outperforms zero-shot CLIP by a large margin on multiple benchmarks.
arXiv Detail & Related papers (2022-11-23T07:00:11Z) - Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model [39.722927180264584]
We propose a novel Dual-modality Prompt Tuning (DPT) paradigm through learning text and visual prompts simultaneously.
To make the final image feature concentrate more on the target visual concept, a Class-Aware Visual Prompt Tuning scheme is proposed.
arXiv Detail & Related papers (2022-08-17T15:06:36Z) - Aligning Visual Prototypes with BERT Embeddings for Few-Shot Learning [48.583388368897126]
Few-shot learning is the task of learning to recognize previously unseen categories of images.
We propose a method that takes into account the names of the image classes.
arXiv Detail & Related papers (2021-05-21T08:08:28Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.