Related papers: ICPL-ReID: Identity-Conditional Prompt Learning for Multi-Spectral Object Re-Identification

ICPL-ReID: Identity-Conditional Prompt Learning for Multi-Spectral Object Re-Identification

URL: http://arxiv.org/abs/2505.17821v1
Date: Fri, 23 May 2025 12:38:27 GMT
Title: ICPL-ReID: Identity-Conditional Prompt Learning for Multi-Spectral Object Re-Identification
Authors: Shihao Li, Chenglong Li, Aihua Zheng, Jin Tang, Bin Luo,
Abstract summary: Multi-spectral object re-identification (ReID) brings a new perception perspective for smart city and intelligent transportation applications.<n>Most existing methods fuse spectral data through intricate modal interaction modules, lacking fine-grained semantic understanding of spectral information.<n>We propose a novel Identity-Conditional text Prompt Learning framework (ICPL), which exploits the powerful cross-modal alignment capability of CLIP.
Score: 25.953780086825457
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-spectral object re-identification (ReID) brings a new perception perspective for smart city and intelligent transportation applications, effectively addressing challenges from complex illumination and adverse weather. However, complex modal differences between heterogeneous spectra pose challenges to efficiently utilizing complementary and discrepancy of spectra information. Most existing methods fuse spectral data through intricate modal interaction modules, lacking fine-grained semantic understanding of spectral information (\textit{e.g.}, text descriptions, part masks, and object keypoints). To solve this challenge, we propose a novel Identity-Conditional text Prompt Learning framework (ICPL), which exploits the powerful cross-modal alignment capability of CLIP, to unify different spectral visual features from text semantics. Specifically, we first propose the online prompt learning using learnable text prompt as the identity-level semantic center to bridge the identity semantics of different spectra in online manner. Then, in lack of concrete text descriptions, we propose the multi-spectral identity-condition module to use identity prototype as spectral identity condition to constraint prompt learning. Meanwhile, we construct the alignment loop mutually optimizing the learnable text prompt and spectral visual encoder to avoid online prompt learning disrupting the pre-trained text-image alignment distribution. In addition, to adapt to small-scale multi-spectral data and mitigate style differences between spectra, we propose multi-spectral adapter that employs a low-rank adaption method to learn spectra-specific features. Comprehensive experiments on 5 benchmarks, including RGBNT201, Market-MM, MSVR310, RGBN300, and RGBNT100, demonstrate that the proposed method outperforms the state-of-the-art methods.

Related papers

Context-Adaptive Multi-Prompt Embedding with Large Language Models for Vision-Language Alignment [33.152772648399846]
We propose a novel approach to enrich semantic representations in vision-language contrastive learning.<n>We leverage a pretrained LLM as the text encoder within the CLIP framework, processing all prompts jointly in a single forward pass.<n>The resulting prompt embeddings are combined into a unified text representation, enabling semantically richer alignment with visual features.
arXiv Detail & Related papers (2025-08-03T20:48:43Z)
MGCR-Net:Multimodal Graph-Conditioned Vision-Language Reconstruction Network for Remote Sensing Change Detection [55.702662643521265]
We propose the multimodal graph-conditioned vision-language reconstruction network (MGCR-Net) to explore the semantic interaction capabilities of multimodal data.<n> Experimental results on four public datasets demonstrate that MGCR achieves superior performance compared to mainstream CD methods.
arXiv Detail & Related papers (2025-08-03T02:50:08Z)
Modality-Aware Feature Matching: A Comprehensive Review of Single- and Cross-Modality Techniques [91.26187560114381]
Feature matching is a cornerstone task in computer vision, essential for applications such as image retrieval, stereo matching, 3D reconstruction, and SLAM.<n>This survey comprehensively reviews modality-based feature matching, exploring traditional handcrafted methods and contemporary deep learning approaches.
arXiv Detail & Related papers (2025-07-30T15:56:36Z)
SAViL-Det: Semantic-Aware Vision-Language Model for Multi-Script Text Detection [4.013156524547072]
This paper introduces SAViL-Det, a novel semantic-aware vision-language model that enhances multi-script text detection.<n>The proposed framework adaptively propagates fine-grained semantic information from text prompts to visual features via cross-modal attention.<n>Experiments on challenging benchmarks demonstrate the effectiveness of the proposed approach.
arXiv Detail & Related papers (2025-07-27T09:16:39Z)
Multimodal Prompt Alignment for Facial Expression Recognition [24.470095812039286]
MPA-FER provides fine-grained semantic guidance to the learning process of prompted visual features.<n>Our framework outperforms state-of-the-art methods on three FER benchmark datasets.
arXiv Detail & Related papers (2025-06-26T05:28:57Z)
Video-Level Language-Driven Video-Based Visible-Infrared Person Re-Identification [47.40091830500585]
Video-based Visible-basedInfrared Person Re-Identification (VVIReID) aims to match pedestrian sequences across modalities by extracting modality-in sequence-level features.<n>A framework, video-level language-driven VVI-ReID (VLD), consists of two core modules: inmodality language (IMLP) and spatialtemporal aggregation.
arXiv Detail & Related papers (2025-06-03T04:49:08Z)
NEXT: Multi-Grained Mixture of Experts via Text-Modulation for Multi-Modal Object Re-ID [21.162847644106435]
We propose a reliable multi-modal caption generation method based on attribute confidence.<n>We also propose a novel ReID framework NEXT, the Multi-grained Mixture of Experts via Text-Modulation for Multi-modal Object Re-Identification.
arXiv Detail & Related papers (2025-05-26T13:52:28Z)
Diverse Semantics-Guided Feature Alignment and Decoupling for Visible-Infrared Person Re-Identification [31.011118085494942]
Visible-Infrared Person Re-Identification (VI-ReID) is a challenging task due to the large modality discrepancy between visible and infrared images.<n>We propose a novel Diverse Semantics-guided Feature Alignment and Decoupling (DSFAD) network to align identity-relevant features from different modalities into a textual embedding space.
arXiv Detail & Related papers (2025-05-01T15:55:38Z)
MP-HSIR: A Multi-Prompt Framework for Universal Hyperspectral Image Restoration [15.501904258858112]
Hyperspectral images (HSIs) often suffer from diverse and unknown degradations during imaging.<n>Existing HSI restoration methods rely on specific degradation assumptions, limiting their effectiveness in complex scenarios.<n>We propose MP-HSIR, a novel multi-prompt framework that effectively integrates spectral, textual, and visual prompts to achieve universal HSI restoration.
arXiv Detail & Related papers (2025-03-12T07:40:49Z)
Optimizing Speech Multi-View Feature Fusion through Conditional Computation [51.23624575321469]
Self-supervised learning (SSL) features provide lightweight and versatile multi-view speech representations.<n> SSL features conflict with traditional spectral features like FBanks in terms of update directions.<n>We propose a novel generalized feature fusion framework grounded in conditional computation.
arXiv Detail & Related papers (2025-01-14T12:12:06Z)
ForgeryGPT: Multimodal Large Language Model For Explainable Image Forgery Detection and Localization [49.12958154544838]
ForgeryGPT is a novel framework that advances the Image Forgery Detection and localization task.<n>It captures high-order correlations of forged images from diverse linguistic feature spaces.<n>It enables explainable generation and interactive dialogue through a newly customized Large Language Model (LLM) architecture.
arXiv Detail & Related papers (2024-10-14T07:56:51Z)
KNN Transformer with Pyramid Prompts for Few-Shot Learning [52.735070934075736]
Few-Shot Learning aims to recognize new classes with limited labeled data. Recent studies have attempted to address the challenge of rare samples with textual prompts to modulate visual features.
arXiv Detail & Related papers (2024-10-14T07:39:30Z)
LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained Descriptors [58.75140338866403]
DVDet is a Descriptor-Enhanced Open Vocabulary Detector. It transforms regional embeddings into image-like representations that can be directly integrated into general open vocabulary detection training. Extensive experiments over multiple large-scale benchmarks show that DVDet outperforms the state-of-the-art consistently by large margins.
arXiv Detail & Related papers (2024-02-07T07:26:49Z)
CLIP-Driven Semantic Discovery Network for Visible-Infrared Person Re-Identification [39.262536758248245]
Cross-modality identity matching poses significant challenges in VIReID. We propose a CLIP-Driven Semantic Discovery Network (CSDN) that consists of Modality-specific Prompt Learner, Semantic Information Integration, and High-level Semantic Embedding.
arXiv Detail & Related papers (2024-01-11T10:20:13Z)
TOP-ReID: Multi-spectral Object Re-Identification with Token Permutation [64.65950381870742]
We propose a cyclic token permutation framework for multi-spectral object ReID, dubbled TOP-ReID. We also propose a Token Permutation Module (TPM) for cyclic multi-spectral feature aggregation. Our proposed framework can generate more discriminative multi-spectral features for robust object ReID.
arXiv Detail & Related papers (2023-12-15T08:54:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.