CPCL: Cross-Modal Prototypical Contrastive Learning for Weakly
Supervised Text-based Person Re-Identification
- URL: http://arxiv.org/abs/2401.10011v1
- Date: Thu, 18 Jan 2024 14:27:01 GMT
- Title: CPCL: Cross-Modal Prototypical Contrastive Learning for Weakly
Supervised Text-based Person Re-Identification
- Authors: Yanwei Zheng, Xinpeng Zhao, Chuanlin Lan, Xiaowei Zhang, Bowen Huang,
Jibin Yang, Dongxiao Yu
- Abstract summary: Weakly supervised text-based person re-identification (TPRe-ID) seeks to retrieve images of a target person using textual descriptions.
The primary challenge is the intra-class differences, encompassing intra-modal feature variations and cross-modal semantic gaps.
In practice, the CPCL introduces the CLIP model to weakly supervised TPRe-ID for the first time, mapping visual and textual instances into a shared latent space.
- Score: 10.64115914599574
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Weakly supervised text-based person re-identification (TPRe-ID) seeks to
retrieve images of a target person using textual descriptions, without relying
on identity annotations and is more challenging and practical. The primary
challenge is the intra-class differences, encompassing intra-modal feature
variations and cross-modal semantic gaps. Prior works have focused on
instance-level samples and ignored prototypical features of each person which
are intrinsic and invariant. Toward this, we propose a Cross-Modal Prototypical
Contrastive Learning (CPCL) method. In practice, the CPCL introduces the CLIP
model to weakly supervised TPRe-ID for the first time, mapping visual and
textual instances into a shared latent space. Subsequently, the proposed
Prototypical Multi-modal Memory (PMM) module captures associations between
heterogeneous modalities of image-text pairs belonging to the same person
through the Hybrid Cross-modal Matching (HCM) module in a many-to-many mapping
fashion. Moreover, the Outlier Pseudo Label Mining (OPLM) module further
distinguishes valuable outlier samples from each modality, enhancing the
creation of more reliable clusters by mining implicit relationships between
image-text pairs. Experimental results demonstrate that our proposed CPCL
attains state-of-the-art performance on all three public datasets, with a
significant improvement of 11.58%, 8.77% and 5.25% in Rank@1 accuracy on
CUHK-PEDES, ICFG-PEDES and RSTPReid datasets, respectively. The code is
available at https://github.com/codeGallery24/CPCL.
Related papers
- CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training [17.27516384073838]
We propose CMAL, a Cross-Modal Associative Learning framework with anchor points detection and cross-modal associative learning.
CMAL achieves competitive performance against previous CMCL-based methods on four common downstream vision-and-language tasks.
arXiv Detail & Related papers (2024-10-16T14:12:26Z) - Multi-Memory Matching for Unsupervised Visible-Infrared Person Re-Identification [30.983346937558743]
Key challenges in USL-VI-ReID are to effectively generate pseudo-labels and establish pseudo-label correspondences.
We propose a Multi-Memory Matching framework for USL-VI-ReID.
Experiments on the public SYSU-MM01 and RegDB datasets demonstrate the reliability of the established cross-modality correspondences.
arXiv Detail & Related papers (2024-01-12T01:24:04Z) - FLIP: Fine-grained Alignment between ID-based Models and Pretrained Language Models for CTR Prediction [49.510163437116645]
Click-through rate (CTR) prediction plays as a core function module in personalized online services.
Traditional ID-based models for CTR prediction take as inputs the one-hot encoded ID features of tabular modality.
Pretrained Language Models(PLMs) has given rise to another paradigm, which takes as inputs the sentences of textual modality.
We propose to conduct Fine-grained feature-level ALignment between ID-based Models and Pretrained Language Models(FLIP) for CTR prediction.
arXiv Detail & Related papers (2023-10-30T11:25:03Z) - Uncovering Prototypical Knowledge for Weakly Open-Vocabulary Semantic
Segmentation [59.37587762543934]
This paper studies the problem of weakly open-vocabulary semantic segmentation (WOVSS)
Existing methods suffer from a granularity inconsistency regarding the usage of group tokens.
We propose the prototypical guidance network (PGSeg) that incorporates multi-modal regularization.
arXiv Detail & Related papers (2023-10-29T13:18:00Z) - Efficient Bilateral Cross-Modality Cluster Matching for Unsupervised Visible-Infrared Person ReID [56.573905143954015]
We propose a novel bilateral cluster matching-based learning framework to reduce the modality gap by matching cross-modality clusters.
Under such a supervisory signal, a Modality-Specific and Modality-Agnostic (MSMA) contrastive learning framework is proposed to align features jointly at a cluster-level.
Experiments on the public SYSU-MM01 and RegDB datasets demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2023-05-22T03:27:46Z) - End-to-End Context-Aided Unicity Matching for Person Re-identification [100.02321122258638]
We propose an end-to-end person unicity matching architecture for learning and refining the person matching relations.
We use the samples' global context relationship to refine the soft matching results and reach the matching unicity through bipartite graph matching.
Given full consideration to real-world person re-identification applications, we achieve the unicity matching in both one-shot and multi-shot settings.
arXiv Detail & Related papers (2022-10-20T07:33:57Z) - InsCon:Instance Consistency Feature Representation via Self-Supervised
Learning [9.416267640069297]
We propose a new end-to-end self-supervised framework called InsCon, which is devoted to capturing multi-instance information.
InsCon builds a targeted learning paradigm that applies multi-instance images as input, aligning the learned feature between corresponding instance views.
On the other hand, InsCon introduces the pull and push of cell-instance, which utilizes cell consistency to enhance fine-grained feature representation.
arXiv Detail & Related papers (2022-03-15T07:09:00Z) - CoADNet: Collaborative Aggregation-and-Distribution Networks for
Co-Salient Object Detection [91.91911418421086]
Co-Salient Object Detection (CoSOD) aims at discovering salient objects that repeatedly appear in a given query group containing two or more relevant images.
One challenging issue is how to effectively capture co-saliency cues by modeling and exploiting inter-image relationships.
We present an end-to-end collaborative aggregation-and-distribution network (CoADNet) to capture both salient and repetitive visual patterns from multiple images.
arXiv Detail & Related papers (2020-11-10T04:28:11Z) - Referring Image Segmentation via Cross-Modal Progressive Comprehension [94.70482302324704]
Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression.
Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities.
We propose a Cross-Modal Progressive (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task.
arXiv Detail & Related papers (2020-10-01T16:02:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.