Related papers: CLIP-Driven Semantic Discovery Network for Visible-Infrared Person Re-Identification

CLIP-Driven Semantic Discovery Network for Visible-Infrared Person Re-Identification

URL: http://arxiv.org/abs/2401.05806v2
Date: Fri, 12 Jan 2024 11:14:09 GMT
Title: CLIP-Driven Semantic Discovery Network for Visible-Infrared Person Re-Identification
Authors: Xiaoyan Yu, Neng Dong, Liehuang Zhu, Hao Peng, Dapeng Tao
Abstract summary: Cross-modality identity matching poses significant challenges in VIReID. We propose a CLIP-Driven Semantic Discovery Network (CSDN) that consists of Modality-specific Prompt Learner, Semantic Information Integration, and High-level Semantic Embedding.
Score: 39.262536758248245
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visible-infrared person re-identification (VIReID) primarily deals with matching identities across person images from different modalities. Due to the modality gap between visible and infrared images, cross-modality identity matching poses significant challenges. Recognizing that high-level semantics of pedestrian appearance, such as gender, shape, and clothing style, remain consistent across modalities, this paper intends to bridge the modality gap by infusing visual features with high-level semantics. Given the capability of CLIP to sense high-level semantic information corresponding to visual representations, we explore the application of CLIP within the domain of VIReID. Consequently, we propose a CLIP-Driven Semantic Discovery Network (CSDN) that consists of Modality-specific Prompt Learner, Semantic Information Integration (SII), and High-level Semantic Embedding (HSE). Specifically, considering the diversity stemming from modality discrepancies in language descriptions, we devise bimodal learnable text tokens to capture modality-private semantic information for visible and infrared images, respectively. Additionally, acknowledging the complementary nature of semantic details across different modalities, we integrate text features from the bimodal language descriptions to achieve comprehensive semantics. Finally, we establish a connection between the integrated text features and the visual features across modalities. This process embed rich high-level semantic information into visual representations, thereby promoting the modality invariance of visual representations. The effectiveness and superiority of our proposed CSDN over existing methods have been substantiated through experimental evaluations on multiple widely used benchmarks. The code will be released at \url{https://github.com/nengdong96/CSDN}.

Related papers

Diverse Semantics-Guided Feature Alignment and Decoupling for Visible-Infrared Person Re-Identification [31.011118085494942]
Visible-Infrared Person Re-Identification (VI-ReID) is a challenging task due to the large modality discrepancy between visible and infrared images. We propose a novel Diverse Semantics-guided Feature Alignment and Decoupling (DSFAD) network to align identity-relevant features from different modalities into a textual embedding space.
arXiv Detail & Related papers (2025-05-01T15:55:38Z)
Visual and Semantic Prompt Collaboration for Generalized Zero-Shot Learning [58.73625654718187]
Generalized zero-shot learning aims to recognize both seen and unseen classes with the help of semantic information that is shared among different classes. Existing approaches fine-tune the visual backbone by seen-class data to obtain semantic-related visual features. This paper proposes a novel visual and semantic prompt collaboration framework, which utilizes prompt tuning techniques for efficient feature adaptation.
arXiv Detail & Related papers (2025-03-29T10:17:57Z)
Embedding and Enriching Explicit Semantics for Visible-Infrared Person Re-Identification [31.011118085494942]
Visible-infrared person re-identification (VIReID) retrieves pedestrian images with the same identity across different modalities. Existing methods learn visual content solely from images, lacking the capability to sense high-level semantics. We propose an Embedding and Enriching Explicit Semantics framework to learn semantically rich cross-modality pedestrian representations.
arXiv Detail & Related papers (2024-12-11T14:27:30Z)
See What You Seek: Semantic Contextual Integration for Cloth-Changing Person Re-Identification [16.845045499676793]
Cloth-changing person re-identification (CC-ReID) aims to match individuals across multiple surveillance cameras despite variations in clothing. Existing methods typically focus on mitigating the effects of clothing changes or enhancing ID-relevant features. We propose a novel prompt learning framework, Semantic Contextual Integration (SCI), for CC-ReID.
arXiv Detail & Related papers (2024-12-02T10:11:16Z)
Multimodal LLM Enhanced Cross-lingual Cross-modal Retrieval [40.83470534691711]
Cross-lingual cross-modal retrieval ( CCR) aims to retrieve visually relevant content based on non-English queries. One popular approach involves utilizing machine translation (MT) to create pseudo-parallel data pairs. We propose LE CCR, a novel solution that incorporates the multi-modal large language model (MLLM) to improve the alignment between visual and non-English representations.
arXiv Detail & Related papers (2024-09-30T05:25:51Z)
SEDS: Semantically Enhanced Dual-Stream Encoder for Sign Language Retrieval [82.51117533271517]
Previous works typically only encode RGB videos to obtain high-level semantic features. Existing RGB-based sign retrieval works suffer from the huge memory cost of dense visual data embedding in end-to-end training. We propose a novel sign language representation framework called Semantically Enhanced Dual-Stream.
arXiv Detail & Related papers (2024-07-23T11:31:11Z)
Language Guided Domain Generalized Medical Image Segmentation [68.93124785575739]
Single source domain generalization holds promise for more reliable and consistent image segmentation across real-world clinical settings. We propose an approach that explicitly leverages textual information by incorporating a contrastive learning mechanism guided by the text encoder features. Our approach achieves favorable performance against existing methods in literature.
arXiv Detail & Related papers (2024-04-01T17:48:15Z)
Shape-Erased Feature Learning for Visible-Infrared Person Re-Identification [90.39454748065558]
Body shape is one of the significant modality-shared cues for VI-ReID. We propose shape-erased feature learning paradigm that decorrelates modality-shared features in two subspaces. Experiments on SYSU-MM01, RegDB, and HITSZ-VCM datasets demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2023-04-09T10:22:10Z)
Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features [11.48760300147023]
This paper introduces a novel method, called Multi-Almod Text Recognition Network (MATRN) MATRN identifies visual and semantic feature pairs and encodes spatial information into semantic features. Our experiments demonstrate that MATRN achieves state-of-the-art performances on seven benchmarks with large margins.
arXiv Detail & Related papers (2021-11-30T10:22:11Z)
CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS) CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment. Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)
AXM-Net: Cross-Modal Context Sharing Attention Network for Person Re-ID [20.700750237972155]
Cross-modal person re-identification (Re-ID) is critical for modern video surveillance systems. Key challenge is to align inter-modality representations according to semantic information present for a person and ignore background information. We present AXM-Net, a novel CNN based architecture designed for learning semantically aligned visual and textual representations.
arXiv Detail & Related papers (2021-01-19T16:06:39Z)
Referring Image Segmentation via Cross-Modal Progressive Comprehension [94.70482302324704]
Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression. Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities. We propose a Cross-Modal Progressive (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task.
arXiv Detail & Related papers (2020-10-01T16:02:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.