Related papers: Dynamic Dual-Attentive Aggregation Learning for Visible-Infrared Person Re-Identification

Dynamic Dual-Attentive Aggregation Learning for Visible-Infrared Person Re-Identification

URL: http://arxiv.org/abs/2007.09314v1
Date: Sat, 18 Jul 2020 03:08:13 GMT
Title: Dynamic Dual-Attentive Aggregation Learning for Visible-Infrared Person Re-Identification
Authors: Mang Ye, Jianbing Shen, David J. Crandall, Ling Shao, Jiebo Luo
Abstract summary: Visible-infrared person re-identification (VI-ReID) is a challenging cross-modality pedestrian retrieval problem. Existing VI-ReID methods tend to learn global representations, which have limited discriminability and weak robustness to noisy images. We propose a novel dynamic dual-attentive aggregation (DDAG) learning method by mining both intra-modality part-level and cross-modality graph-level contextual cues for VI-ReID.
Score: 208.1227090864602
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visible-infrared person re-identification (VI-ReID) is a challenging cross-modality pedestrian retrieval problem. Due to the large intra-class variations and cross-modality discrepancy with large amount of sample noise, it is difficult to learn discriminative part features. Existing VI-ReID methods instead tend to learn global representations, which have limited discriminability and weak robustness to noisy images. In this paper, we propose a novel dynamic dual-attentive aggregation (DDAG) learning method by mining both intra-modality part-level and cross-modality graph-level contextual cues for VI-ReID. We propose an intra-modality weighted-part attention module to extract discriminative part-aggregated features, by imposing the domain knowledge on the part relationship mining. To enhance robustness against noisy samples, we introduce cross-modality graph structured attention to reinforce the representation with the contextual relations across the two modalities. We also develop a parameter-free dynamic dual aggregation learning strategy to adaptively integrate the two components in a progressive joint training manner. Extensive experiments demonstrate that DDAG outperforms the state-of-the-art methods under various settings.

Related papers

DIS2: Disentanglement Meets Distillation with Classwise Attention for Robust Remote Sensing Segmentation under Missing Modalities [28.992992584085787]
DIS2 is a new paradigm shifting from modality-shared feature dependence to active, guided missing features compensation.<n> Compensatory features are explicitly captured which, when fused with the features of the available modality, approximate the ideal fused representation of the full-modality case.<n>Our proposed approach significantly outperforms state-of-the-art methods across benchmarks.
arXiv Detail & Related papers (2026-01-20T01:33:54Z)
Dual-level Modality Debiasing Learning for Unsupervised Visible-Infrared Person Re-Identification [59.59359638389348]
We propose a Dual-level Modality Debiasing Learning framework that implements debiasing at both the model and optimization levels.<n>Experiments on benchmark datasets demonstrate that DMDL could enable modality-invariant feature learning and a more generalized model.
arXiv Detail & Related papers (2025-12-03T12:43:16Z)
X-ReID: Multi-granularity Information Interaction for Video-Based Visible-Infrared Person Re-Identification [79.37768038337971]
We propose a novel cross-modality feature learning framework named X-ReID for VVI-ReID.<n> Specifically, we first propose a Cross-modality Prototype Collaboration (CPC)<n>Then, a Multi-granularity Information Interaction (MII) is designed, incorporating short-term interactions from adjacent frames, long-term cross-frame information fusion, and cross-modality feature alignment.
arXiv Detail & Related papers (2025-11-22T07:57:15Z)
DINOv2 Driven Gait Representation Learning for Video-Based Visible-Infrared Person Re-identification [30.593882551803855]
Video-based Visible-Infrared person re-identification (VVI-ID) aims to retrieve the same pedestrian across visible and infrared modalities from modality.<n>To address these challenges, we propose a Gait Representation Learning framework that leverages the rich visual priors of DINOv2 to learn gait features complementary to appearance cues.<n>Specifically, we introduce a Semantic-Aware Silhouette and Gait Learning (GL) model, which generates and enhances silhouette representations with general-purpose semantic priors from DINOv2.
arXiv Detail & Related papers (2025-11-06T11:21:13Z)
Confidence-driven Gradient Modulation for Multimodal Human Activity Recognition: A Dynamic Contrastive Dual-Path Learning Approach [3.0868241505670198]
We propose a novel framework called the Dynamic Contrastive Dual-Path Network (D-HAR)<n>The framework comprises three key components. First, a dual-path feature extraction architecture is employed, where ResNet and DenseCDPNet branches collaboratively process multimodal sensor data.<n>Second, a multi-stage contrastive learning mechanism is introduced to achieve progressive alignment from local perception to semantic abstraction.<n>Third, we present a confidence-driven gradient modulation strategy that dynamically monitors and adjusts the learning intensity of each modality branch during backpropagation.
arXiv Detail & Related papers (2025-07-03T17:37:46Z)
MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z)
SkillMimic-V2: Learning Robust and Generalizable Interaction Skills from Sparse and Noisy Demonstrations [68.9300049150948]
We address a fundamental challenge in Reinforcement Learning from Interaction Demonstration (RLID)<n>Existing data collection approaches yield sparse, disconnected, and noisy trajectories that fail to capture the full spectrum of possible skill variations and transitions.<n>We present two data augmentation techniques: a Stitched Trajectory Graph (STG) that discovers potential transitions between demonstration skills, and a State Transition Field (STF) that establishes unique connections for arbitrary states within the demonstration neighborhood.
arXiv Detail & Related papers (2025-05-04T13:00:29Z)
AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection. Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
arXiv Detail & Related papers (2025-04-06T13:59:16Z)
Frequency Domain Modality-invariant Feature Learning for Visible-infrared Person Re-Identification [79.9402521412239]
We propose a novel Frequency Domain modality-invariant feature learning framework (FDMNet) to reduce modality discrepancy from the frequency domain perspective. Our framework introduces two novel modules, namely the Instance-Adaptive Amplitude Filter (IAF) and the Phrase-Preserving Normalization (PPNorm)
arXiv Detail & Related papers (2024-01-03T17:11:27Z)
Modality Unifying Network for Visible-Infrared Person Re-Identification [24.186989535051623]
Visible-infrared person re-identification (VI-ReID) is a challenging task due to large cross-modality discrepancies and intra-class variations. Existing methods mainly focus on learning modality-shared representations by embedding different modalities into the same feature space. We propose a novel Modality Unifying Network (MUN) to explore a robust auxiliary modality for VI-ReID.
arXiv Detail & Related papers (2023-09-12T14:22:22Z)
Shape-Erased Feature Learning for Visible-Infrared Person Re-Identification [90.39454748065558]
Body shape is one of the significant modality-shared cues for VI-ReID. We propose shape-erased feature learning paradigm that decorrelates modality-shared features in two subspaces. Experiments on SYSU-MM01, RegDB, and HITSZ-VCM datasets demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2023-04-09T10:22:10Z)
Learning Progressive Modality-shared Transformers for Effective Visible-Infrared Person Re-identification [27.75907274034702]
We propose a novel deep learning framework named Progressive Modality-shared Transformer (PMT) for effective VI-ReID. To reduce the negative effect of modality gaps, we first take the gray-scale images as an auxiliary modality and propose a progressive learning strategy. To cope with the problem of large intra-class differences and small inter-class differences, we propose a Discriminative Center Loss.
arXiv Detail & Related papers (2022-12-01T02:20:16Z)
Probing Visual-Audio Representation for Video Highlight Detection via Hard-Pairs Guided Contrastive Learning [23.472951216815765]
Key to effective video representations is cross-modal representation learning and fine-grained feature discrimination. In this paper, we enrich intra-modality and cross-modality relations for representation modeling. We enlarge the discriminative power of feature embedding with a hard-pairs guided contrastive learning scheme.
arXiv Detail & Related papers (2022-06-21T07:29:37Z)
On Exploring Pose Estimation as an Auxiliary Learning Task for Visible-Infrared Person Re-identification [66.58450185833479]
In this paper, we exploit Pose Estimation as an auxiliary learning task to assist the VI-ReID task in an end-to-end framework. By jointly training these two tasks in a mutually beneficial manner, our model learns higher quality modality-shared and ID-related features. Experimental results on two benchmark VI-ReID datasets show that the proposed method consistently improves state-of-the-art methods by significant margins.
arXiv Detail & Related papers (2022-01-11T09:44:00Z)
Multi-Modal Mutual Information Maximization: A Novel Approach for Unsupervised Deep Cross-Modal Hashing [73.29587731448345]
We propose a novel method, dubbed Cross-Modal Info-Max Hashing (CMIMH) We learn informative representations that can preserve both intra- and inter-modal similarities. The proposed method consistently outperforms other state-of-the-art cross-modal retrieval methods.
arXiv Detail & Related papers (2021-12-13T08:58:03Z)
Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text. These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining. We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z)
Multi-Scale Cascading Network with Compact Feature Learning for RGB-Infrared Person Re-Identification [35.55895776505113]
Multi-Scale Part-Aware Cascading framework (MSPAC) is formulated by aggregating multi-scale fine-grained features from part to global. Cross-modality correlations can thus be efficiently explored on salient features for distinctive modality-invariant feature learning.
arXiv Detail & Related papers (2020-12-12T15:39:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.