Related papers: X-ReID: Multi-granularity Information Interaction for Video-Based Visible-Infrared Person Re-Identification

X-ReID: Multi-granularity Information Interaction for Video-Based Visible-Infrared Person Re-Identification

URL: http://arxiv.org/abs/2511.17964v2
Date: Tue, 25 Nov 2025 05:11:45 GMT
Title: X-ReID: Multi-granularity Information Interaction for Video-Based Visible-Infrared Person Re-Identification
Authors: Chenyang Yu, Xuehu Liu, Pingping Zhang, Huchuan Lu,
Abstract summary: We propose a novel cross-modality feature learning framework named X-ReID for VVI-ReID.<n> Specifically, we first propose a Cross-modality Prototype Collaboration (CPC)<n>Then, a Multi-granularity Information Interaction (MII) is designed, incorporating short-term interactions from adjacent frames, long-term cross-frame information fusion, and cross-modality feature alignment.
Score: 79.37768038337971
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large-scale vision-language models (e.g., CLIP) have recently achieved remarkable performance in retrieval tasks, yet their potential for Video-based Visible-Infrared Person Re-Identification (VVI-ReID) remains largely unexplored. The primary challenges are narrowing the modality gap and leveraging spatiotemporal information in video sequences. To address the above issues, in this paper, we propose a novel cross-modality feature learning framework named X-ReID for VVI-ReID. Specifically, we first propose a Cross-modality Prototype Collaboration (CPC) to align and integrate features from different modalities, guiding the network to reduce the modality discrepancy. Then, a Multi-granularity Information Interaction (MII) is designed, incorporating short-term interactions from adjacent frames, long-term cross-frame information fusion, and cross-modality feature alignment to enhance temporal modeling and further reduce modality gaps. Finally, by integrating multi-granularity information, a robust sequence-level representation is achieved. Extensive experiments on two large-scale VVI-ReID benchmarks (i.e., HITSZ-VCM and BUPTCampus) demonstrate the superiority of our method over state-of-the-art methods. The source code is released at https://github.com/AsuradaYuci/X-ReID.

Related papers

DINOv2 Driven Gait Representation Learning for Video-Based Visible-Infrared Person Re-identification [30.593882551803855]
Video-based Visible-Infrared person re-identification (VVI-ID) aims to retrieve the same pedestrian across visible and infrared modalities from modality.<n>To address these challenges, we propose a Gait Representation Learning framework that leverages the rich visual priors of DINOv2 to learn gait features complementary to appearance cues.<n>Specifically, we introduce a Semantic-Aware Silhouette and Gait Learning (GL) model, which generates and enhances silhouette representations with general-purpose semantic priors from DINOv2.
arXiv Detail & Related papers (2025-11-06T11:21:13Z)
Hierarchical Identity Learning for Unsupervised Visible-Infrared Person Re-Identification [81.3063589622217]
Unsupervised visible-infrared person re-identification (USVI-ReID) aims to learn modality-invariant image features from unlabeled cross-modal person datasets.
arXiv Detail & Related papers (2025-09-15T05:10:43Z)
AG-VPReID.VIR: Bridging Aerial and Ground Platforms for Video-based Visible-Infrared Person Re-ID [36.00219379027019]
We present AG-VPReID.VIR, the first aerial-ground cross-modality video-based person Re-ID dataset.<n>This dataset captures 1,837 identities across 4,861 tracklets (124,855 frames) using both UAV-mounted and fixed CCTV cameras in RGB and infrared modalities.<n>Our approach bridges the domain gaps between aerial-ground perspectives and RGB-IR modalities, through style-robust feature learning, memory-based cross-view adaptation, and intermediary-guided temporal modeling.
arXiv Detail & Related papers (2025-07-24T00:13:25Z)
DiVE: Efficient Multi-View Driving Scenes Generation Based on Video Diffusion Transformer [56.98400572837792]
DiVE produces high-fidelity, temporally coherent, and cross-view consistent multi-view videos.<n>These innovations collectively achieve a 2.62x speedup with minimal quality degradation.
arXiv Detail & Related papers (2025-04-28T09:20:50Z)
Multi-granularity Contrastive Cross-modal Collaborative Generation for End-to-End Long-term Video Question Answering [53.39158264785098]
Long-term Video Question Answering (VideoQA) is a challenging vision-and-language bridging task. We present an entirely end-to-end solution for VideoQA: Multi-granularity Contrastive cross-modal collaborative Generation model.
arXiv Detail & Related papers (2024-10-12T06:21:58Z)
Modality Unifying Network for Visible-Infrared Person Re-Identification [24.186989535051623]
Visible-infrared person re-identification (VI-ReID) is a challenging task due to large cross-modality discrepancies and intra-class variations. Existing methods mainly focus on learning modality-shared representations by embedding different modalities into the same feature space. We propose a novel Modality Unifying Network (MUN) to explore a robust auxiliary modality for VI-ReID.
arXiv Detail & Related papers (2023-09-12T14:22:22Z)
Video-based Person Re-identification with Long Short-Term Representation Learning [101.62570747820541]
Video-based person Re-Identification (V-ReID) aims to retrieve specific persons from raw videos captured by non-overlapped cameras. We propose a novel deep learning framework named Long Short-Term Representation Learning (LSTRL) for effective V-ReID.
arXiv Detail & Related papers (2023-08-07T16:22:47Z)
A Unified Multimodal De- and Re-coupling Framework for RGB-D Motion Recognition [24.02488085447691]
We introduce a novel video data augmentation method dubbed ShuffleMix, which acts as a supplement to MixUp, to provide additional temporal regularization for motion recognition. Secondly, a Unified Multimodal De-coupling and multi-stage Re-coupling framework, termed UMDR, is proposed for video representation learning.
arXiv Detail & Related papers (2022-11-16T19:00:23Z)
Full-Duplex Strategy for Video Object Segmentation [141.43983376262815]
Full- Strategy Network (FSNet) is a novel framework for video object segmentation (VOS) Our FSNet performs the crossmodal feature-passing (i.e., transmission and receiving) simultaneously before fusion decoding stage. We show that our FSNet outperforms other state-of-the-arts for both the VOS and video salient object detection tasks.
arXiv Detail & Related papers (2021-08-06T14:50:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.