PVLR: Prompt-driven Visual-Linguistic Representation Learning for
Multi-Label Image Recognition
- URL: http://arxiv.org/abs/2401.17881v1
- Date: Wed, 31 Jan 2024 14:39:11 GMT
- Title: PVLR: Prompt-driven Visual-Linguistic Representation Learning for
Multi-Label Image Recognition
- Authors: Hao Tan, Zichang Tan, Jun Li, Jun Wan, Zhen Lei
- Abstract summary: We propose a Prompt-driven Visual-Linguistic Representation Learning framework to better leverage the capabilities of the linguistic modality.
In contrast to the unidirectional fusion in previous works, we introduce a Dual-Modal Attention (DMA) that enables bidirectional interaction between textual and visual features.
- Score: 47.11517266162346
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-label image recognition is a fundamental task in computer vision.
Recently, vision-language models have made notable advancements in this area.
However, previous methods often failed to effectively leverage the rich
knowledge within language models and instead incorporated label semantics into
visual features in a unidirectional manner. In this paper, we propose a
Prompt-driven Visual-Linguistic Representation Learning (PVLR) framework to
better leverage the capabilities of the linguistic modality. In PVLR, we first
introduce a dual-prompting strategy comprising Knowledge-Aware Prompting (KAP)
and Context-Aware Prompting (CAP). KAP utilizes fixed prompts to capture the
intrinsic semantic knowledge and relationships across all labels, while CAP
employs learnable prompts to capture context-aware label semantics and
relationships. Later, we propose an Interaction and Fusion Module (IFM) to
interact and fuse the representations obtained from KAP and CAP. In contrast to
the unidirectional fusion in previous works, we introduce a Dual-Modal
Attention (DMA) that enables bidirectional interaction between textual and
visual features, yielding context-aware label representations and
semantic-related visual representations, which are subsequently used to
calculate similarities and generate final predictions for all labels. Extensive
experiments on three popular datasets including MS-COCO, Pascal VOC 2007, and
NUS-WIDE demonstrate the superiority of PVLR.
Related papers
- SSPA: Split-and-Synthesize Prompting with Gated Alignments for Multi-Label Image Recognition [71.90536979421093]
We propose a Split-and-Synthesize Prompting with Gated Alignments (SSPA) framework to amplify the potential of Vision-Language Models (VLMs)
We develop an in-context learning approach to associate the inherent knowledge from LLMs.
Then we propose a novel Split-and-Synthesize Prompting (SSP) strategy to first model the generic knowledge and downstream label semantics individually.
arXiv Detail & Related papers (2024-07-30T15:58:25Z) - TAI++: Text as Image for Multi-Label Image Classification by Co-Learning Transferable Prompt [15.259819430801402]
We propose a pseudo-visual prompt(PVP) module for implicit visual prompt tuning to address this problem.
Specifically, we first learn the pseudo-visual prompt for each category, mining diverse visual knowledge by the well-aligned space of pre-trained vision-language models.
Experimental results on VOC2007, MS-COCO, and NUSWIDE datasets demonstrate that our method can surpass state-of-the-art(SOTA) methods.
arXiv Detail & Related papers (2024-05-11T06:11:42Z) - Data-free Multi-label Image Recognition via LLM-powered Prompt Tuning [23.671999163027284]
This paper proposes a novel framework for multi-label image recognition without any training data.
It uses knowledge of pre-trained Large Language Model to learn prompts to adapt pretrained Vision-Language Model like CLIP to multilabel classification.
Our framework presents a new way to explore the synergies between multiple pre-trained models for novel category recognition.
arXiv Detail & Related papers (2024-03-02T13:43:32Z) - Exploring Part-Informed Visual-Language Learning for Person
Re-Identification [40.725052076983516]
We propose to enhance fine-grained visual features with part-informed language supervision for visual-based person re-identification tasks.
Our $pi$-VL achieves substantial improvements over previous state-of-the-arts on four common-used ReID benchmarks.
arXiv Detail & Related papers (2023-08-04T23:13:49Z) - DualCoOp++: Fast and Effective Adaptation to Multi-Label Recognition
with Limited Annotations [79.433122872973]
Multi-label image recognition in the low-label regime is a task of great challenge and practical significance.
We leverage the powerful alignment between textual and visual features pretrained with millions of auxiliary image-text pairs.
We introduce an efficient and effective framework called Evidence-guided Dual Context Optimization (DualCoOp++)
arXiv Detail & Related papers (2023-08-03T17:33:20Z) - MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations.
Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z) - Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels.
We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction.
We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.