ShapeSpeak: Body Shape-Aware Textual Alignment for Visible-Infrared Person Re-Identification
- URL: http://arxiv.org/abs/2504.18025v1
- Date: Fri, 25 Apr 2025 02:37:47 GMT
- Title: ShapeSpeak: Body Shape-Aware Textual Alignment for Visible-Infrared Person Re-Identification
- Authors: Shuanglin Yan, Neng Dong, Shuang Li, Rui Yan, Hao Tang, Jing Qin,
- Abstract summary: Visible-Infrared Person Re-identification (VIReID) aims to match visible and infrared pedestrian images.<n> Existing methods rely solely on identity label supervision.<n> vision-language pre-trained models have been introduced to VIReID, enhancing semantic information modeling.
- Score: 34.82553240281019
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visible-Infrared Person Re-identification (VIReID) aims to match visible and infrared pedestrian images, but the modality differences and the complexity of identity features make it challenging. Existing methods rely solely on identity label supervision, which makes it difficult to fully extract high-level semantic information. Recently, vision-language pre-trained models have been introduced to VIReID, enhancing semantic information modeling by generating textual descriptions. However, such methods do not explicitly model body shape features, which are crucial for cross-modal matching. To address this, we propose an effective Body Shape-aware Textual Alignment (BSaTa) framework that explicitly models and utilizes body shape information to improve VIReID performance. Specifically, we design a Body Shape Textual Alignment (BSTA) module that extracts body shape information using a human parsing model and converts it into structured text representations via CLIP. We also design a Text-Visual Consistency Regularizer (TVCR) to ensure alignment between body shape textual representations and visual body shape features. Furthermore, we introduce a Shape-aware Representation Learning (SRL) mechanism that combines Multi-text Supervision and Distribution Consistency Constraints to guide the visual encoder to learn modality-invariant and discriminative identity features, thus enhancing modality invariance. Experimental results demonstrate that our method achieves superior performance on the SYSU-MM01 and RegDB datasets, validating its effectiveness.
Related papers
- Diverse Semantics-Guided Feature Alignment and Decoupling for Visible-Infrared Person Re-Identification [31.011118085494942]
Visible-Infrared Person Re-Identification (VI-ReID) is a challenging task due to the large modality discrepancy between visible and infrared images.
We propose a novel Diverse Semantics-guided Feature Alignment and Decoupling (DSFAD) network to align identity-relevant features from different modalities into a textual embedding space.
arXiv Detail & Related papers (2025-05-01T15:55:38Z) - See What You Seek: Semantic Contextual Integration for Cloth-Changing Person Re-Identification [16.845045499676793]
Cloth-changing person re-identification (CC-ReID) aims to match individuals across multiple surveillance cameras despite variations in clothing.<n>Existing methods typically focus on mitigating the effects of clothing changes or enhancing ID-relevant features.<n>We propose a novel prompt learning framework, Semantic Contextual Integration (SCI), for CC-ReID.
arXiv Detail & Related papers (2024-12-02T10:11:16Z) - SHAPE-IT: Exploring Text-to-Shape-Display for Generative Shape-Changing Behaviors with LLMs [12.235304780960142]
This paper introduces text-to-shape-display, a novel approach to generating dynamic shape changes in pin-based shape displays through natural language commands.
By leveraging large language models (LLMs) and AI-chaining, our approach allows users to author shape-changing behaviors on demand through text prompts without programming.
arXiv Detail & Related papers (2024-09-10T04:18:49Z) - CLIP-Driven Semantic Discovery Network for Visible-Infrared Person
Re-Identification [39.262536758248245]
Cross-modality identity matching poses significant challenges in VIReID.
We propose a CLIP-Driven Semantic Discovery Network (CSDN) that consists of Modality-specific Prompt Learner, Semantic Information Integration, and High-level Semantic Embedding.
arXiv Detail & Related papers (2024-01-11T10:20:13Z) - Shape-centered Representation Learning for Visible-Infrared Person
Re-identification [53.56628297970931]
Current Visible-Infrared Person Re-Identification (VI-ReID) methods prioritize extracting distinguishing appearance features.
We propose the Shape-centered Representation Learning framework (ScRL), which focuses on learning shape features and appearance features associated with shapes.
To acquire appearance features related to shape, we design the Appearance Feature Enhancement (AFE), which accentuates identity-related features while suppressing identity-unrelated features guided by shape features.
arXiv Detail & Related papers (2023-10-27T07:57:24Z) - CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model [55.321010757641524]
We introduce CLIP4STR, a simple yet effective STR method built upon image and text encoders of CLIP.<n>We scale CLIP4STR in terms of the model size, pre-training data, and training data, achieving state-of-the-art performance on 13 STR benchmarks.
arXiv Detail & Related papers (2023-05-23T12:51:20Z) - Shape-Erased Feature Learning for Visible-Infrared Person
Re-Identification [90.39454748065558]
Body shape is one of the significant modality-shared cues for VI-ReID.
We propose shape-erased feature learning paradigm that decorrelates modality-shared features in two subspaces.
Experiments on SYSU-MM01, RegDB, and HITSZ-VCM datasets demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2023-04-09T10:22:10Z) - Cross-modal Representation Learning for Zero-shot Action Recognition [67.57406812235767]
We present a cross-modal Transformer-based framework, which jointly encodes video data and text labels for zero-shot action recognition (ZSAR)
Our model employs a conceptually new pipeline by which visual representations are learned in conjunction with visual-semantic associations in an end-to-end manner.
Experiment results show our model considerably improves upon the state of the arts in ZSAR, reaching encouraging top-1 accuracy on UCF101, HMDB51, and ActivityNet benchmark datasets.
arXiv Detail & Related papers (2022-05-03T17:39:27Z) - SimAN: Exploring Self-Supervised Representation Learning of Scene Text
via Similarity-Aware Normalization [66.35116147275568]
Self-supervised representation learning has drawn considerable attention from the scene text recognition community.
We tackle the issue by formulating the representation learning scheme in a generative manner.
We propose a Similarity-Aware Normalization (SimAN) module to identify the different patterns and align the corresponding styles from the guiding patch.
arXiv Detail & Related papers (2022-03-20T08:43:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.