CLIP-Driven Semantic Discovery Network for Visible-Infrared Person
Re-Identification
- URL: http://arxiv.org/abs/2401.05806v2
- Date: Fri, 12 Jan 2024 11:14:09 GMT
- Title: CLIP-Driven Semantic Discovery Network for Visible-Infrared Person
Re-Identification
- Authors: Xiaoyan Yu, Neng Dong, Liehuang Zhu, Hao Peng, Dapeng Tao
- Abstract summary: Cross-modality identity matching poses significant challenges in VIReID.
We propose a CLIP-Driven Semantic Discovery Network (CSDN) that consists of Modality-specific Prompt Learner, Semantic Information Integration, and High-level Semantic Embedding.
- Score: 39.262536758248245
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visible-infrared person re-identification (VIReID) primarily deals with
matching identities across person images from different modalities. Due to the
modality gap between visible and infrared images, cross-modality identity
matching poses significant challenges. Recognizing that high-level semantics of
pedestrian appearance, such as gender, shape, and clothing style, remain
consistent across modalities, this paper intends to bridge the modality gap by
infusing visual features with high-level semantics. Given the capability of
CLIP to sense high-level semantic information corresponding to visual
representations, we explore the application of CLIP within the domain of
VIReID. Consequently, we propose a CLIP-Driven Semantic Discovery Network
(CSDN) that consists of Modality-specific Prompt Learner, Semantic Information
Integration (SII), and High-level Semantic Embedding (HSE). Specifically,
considering the diversity stemming from modality discrepancies in language
descriptions, we devise bimodal learnable text tokens to capture
modality-private semantic information for visible and infrared images,
respectively. Additionally, acknowledging the complementary nature of semantic
details across different modalities, we integrate text features from the
bimodal language descriptions to achieve comprehensive semantics. Finally, we
establish a connection between the integrated text features and the visual
features across modalities. This process embed rich high-level semantic
information into visual representations, thereby promoting the modality
invariance of visual representations. The effectiveness and superiority of our
proposed CSDN over existing methods have been substantiated through
experimental evaluations on multiple widely used benchmarks. The code will be
released at \url{https://github.com/nengdong96/CSDN}.
Related papers
- Embedding and Enriching Explicit Semantics for Visible-Infrared Person Re-Identification [31.011118085494942]
Visible-infrared person re-identification (VIReID) retrieves pedestrian images with the same identity across different modalities.
Existing methods learn visual content solely from images, lacking the capability to sense high-level semantics.
We propose an Embedding and Enriching Explicit Semantics framework to learn semantically rich cross-modality pedestrian representations.
arXiv Detail & Related papers (2024-12-11T14:27:30Z) - See What You Seek: Semantic Contextual Integration for Cloth-Changing Person Re-Identification [16.845045499676793]
Cloth-changing person re-identification (CC-ReID) aims to match individuals across multiple surveillance cameras despite variations in clothing.
Existing methods typically focus on mitigating the effects of clothing changes or enhancing ID-relevant features.
We propose a novel prompt learning framework, Semantic Contextual Integration (SCI), for CC-ReID.
arXiv Detail & Related papers (2024-12-02T10:11:16Z) - SEDS: Semantically Enhanced Dual-Stream Encoder for Sign Language Retrieval [82.51117533271517]
Previous works typically only encode RGB videos to obtain high-level semantic features.
Existing RGB-based sign retrieval works suffer from the huge memory cost of dense visual data embedding in end-to-end training.
We propose a novel sign language representation framework called Semantically Enhanced Dual-Stream.
arXiv Detail & Related papers (2024-07-23T11:31:11Z) - Attend and Enrich: Enhanced Visual Prompt for Zero-Shot Learning [114.59476118365266]
We propose AENet, which endows semantic information into the visual prompt to distill semantic-enhanced prompt for visual representation enrichment.
AENet comprises two key steps: 1) exploring the concept-harmonized tokens for the visual and attribute modalities, grounded on the modal-sharing token that represents consistent visual-semantic concepts; and 2) yielding semantic-enhanced prompt via the visual residual refinement unit with attribute consistency supervision.
arXiv Detail & Related papers (2024-06-05T07:59:48Z) - Language Guided Domain Generalized Medical Image Segmentation [68.93124785575739]
Single source domain generalization holds promise for more reliable and consistent image segmentation across real-world clinical settings.
We propose an approach that explicitly leverages textual information by incorporating a contrastive learning mechanism guided by the text encoder features.
Our approach achieves favorable performance against existing methods in literature.
arXiv Detail & Related papers (2024-04-01T17:48:15Z) - Shape-Erased Feature Learning for Visible-Infrared Person
Re-Identification [90.39454748065558]
Body shape is one of the significant modality-shared cues for VI-ReID.
We propose shape-erased feature learning paradigm that decorrelates modality-shared features in two subspaces.
Experiments on SYSU-MM01, RegDB, and HITSZ-VCM datasets demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2023-04-09T10:22:10Z) - Multi-modal Text Recognition Networks: Interactive Enhancements between
Visual and Semantic Features [11.48760300147023]
This paper introduces a novel method, called Multi-Almod Text Recognition Network (MATRN)
MATRN identifies visual and semantic feature pairs and encodes spatial information into semantic features.
Our experiments demonstrate that MATRN achieves state-of-the-art performances on seven benchmarks with large margins.
arXiv Detail & Related papers (2021-11-30T10:22:11Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - AXM-Net: Cross-Modal Context Sharing Attention Network for Person Re-ID [20.700750237972155]
Cross-modal person re-identification (Re-ID) is critical for modern video surveillance systems.
Key challenge is to align inter-modality representations according to semantic information present for a person and ignore background information.
We present AXM-Net, a novel CNN based architecture designed for learning semantically aligned visual and textual representations.
arXiv Detail & Related papers (2021-01-19T16:06:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.