Adversarial Self-Attack Defense and Spatial-Temporal Relation Mining for
Visible-Infrared Video Person Re-Identification
- URL: http://arxiv.org/abs/2307.03903v3
- Date: Fri, 11 Aug 2023 09:15:27 GMT
- Title: Adversarial Self-Attack Defense and Spatial-Temporal Relation Mining for
Visible-Infrared Video Person Re-Identification
- Authors: Huafeng Li, Le Xu, Yafei Zhang, Dapeng Tao, Zhengtao Yu
- Abstract summary: The paper proposes a new visible-infrared video person re-ID method from a novel perspective, i.e., adversarial self-attack defense and spatial-temporal relation mining.
The proposed method exhibits compelling performance on large-scale cross-modality video datasets.
- Score: 24.9205771457704
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In visible-infrared video person re-identification (re-ID), extracting
features not affected by complex scenes (such as modality, camera views,
pedestrian pose, background, etc.) changes, and mining and utilizing motion
information are the keys to solving cross-modal pedestrian identity matching.
To this end, the paper proposes a new visible-infrared video person re-ID
method from a novel perspective, i.e., adversarial self-attack defense and
spatial-temporal relation mining. In this work, the changes of views, posture,
background and modal discrepancy are considered as the main factors that cause
the perturbations of person identity features. Such interference information
contained in the training samples is used as an adversarial perturbation. It
performs adversarial attacks on the re-ID model during the training to make the
model more robust to these unfavorable factors. The attack from the adversarial
perturbation is introduced by activating the interference information contained
in the input samples without generating adversarial samples, and it can be thus
called adversarial self-attack. This design allows adversarial attack and
defense to be integrated into one framework. This paper further proposes a
spatial-temporal information-guided feature representation network to use the
information in video sequences. The network cannot only extract the information
contained in the video-frame sequences but also use the relation of the local
information in space to guide the network to extract more robust features. The
proposed method exhibits compelling performance on large-scale cross-modality
video datasets. The source code of the proposed method will be released at
https://github.com/lhf12278/xxx.
Related papers
- Generative Adversarial Patches for Physical Attacks on Cross-Modal Pedestrian Re-Identification [24.962600785183582]
Visible-infrared pedestrian Re-identification (VI-ReID) aims to match pedestrian images captured by infrared cameras and visible cameras.
This paper introduces the first physical adversarial attack against VI-ReID models.
arXiv Detail & Related papers (2024-10-26T06:40:10Z) - Erasing, Transforming, and Noising Defense Network for Occluded Person
Re-Identification [36.91680117072686]
We propose Erasing, Transforming, and Noising Defense Network (ETNDNet) to solve occluded person re-ID.
In the proposed ETNDNet, we randomly erase the feature map to create an adversarial representation with incomplete information.
Thirdly, we perturb the feature map with random values to address noisy information introduced by obstacles and non-target pedestrians.
arXiv Detail & Related papers (2023-07-14T06:42:21Z) - Feature Disentanglement Learning with Switching and Aggregation for
Video-based Person Re-Identification [9.068045610800667]
In video person re-identification (Re-ID), the network must consistently extract features of the target person from successive frames.
Existing methods tend to focus only on how to use temporal information, which often leads to networks being fooled by similar appearances and same backgrounds.
We propose a Disentanglement and Switching and Aggregation Network (DSANet), which segregates the features representing identity and features based on camera characteristics, and pays more attention to ID information.
arXiv Detail & Related papers (2022-12-16T04:27:56Z) - Keypoint Message Passing for Video-based Person Re-Identification [106.41022426556776]
Video-based person re-identification (re-ID) is an important technique in visual surveillance systems which aims to match video snippets of people captured by different cameras.
Existing methods are mostly based on convolutional neural networks (CNNs), whose building blocks either process local neighbor pixels at a time, or, when 3D convolutions are used to model temporal information, suffer from the misalignment problem caused by person movement.
In this paper, we propose to overcome the limitations of normal convolutions with a human-oriented graph method. Specifically, features located at person joint keypoints are extracted and connected as a spatial-temporal graph
arXiv Detail & Related papers (2021-11-16T08:01:16Z) - Video Salient Object Detection via Contrastive Features and Attention
Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection.
A co-attention formulation is utilized to combine the low-level and high-level features.
We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z) - Deconfounded Video Moment Retrieval with Causal Intervention [80.90604360072831]
We tackle the task of video moment retrieval (VMR), which aims to localize a specific moment in a video according to a textual query.
Existing methods primarily model the matching relationship between query and moment by complex cross-modal interactions.
We propose a causality-inspired VMR framework that builds structural causal model to capture the true effect of query and video content on the prediction.
arXiv Detail & Related papers (2021-06-03T01:33:26Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z) - A Flow-Guided Mutual Attention Network for Video-Based Person
Re-Identification [25.217641512619178]
Person ReID is a challenging problem in many analytics and surveillance applications.
Video-based person ReID has recently gained much interest because it allows capturing feature discriminant-temporal information.
In this paper, the motion pattern of a person is explored as an additional cue for ReID.
arXiv Detail & Related papers (2020-08-09T18:58:11Z) - Attribute-aware Identity-hard Triplet Loss for Video-based Person
Re-identification [51.110453988705395]
Video-based person re-identification (Re-ID) is an important computer vision task.
We introduce a new metric learning method called Attribute-aware Identity-hard Triplet Loss (AITL)
To achieve a complete model of video-based person Re-ID, a multi-task framework with Attribute-driven Spatio-Temporal Attention (ASTA) mechanism is also proposed.
arXiv Detail & Related papers (2020-06-13T09:15:38Z) - Over-the-Air Adversarial Flickering Attacks against Video Recognition
Networks [54.82488484053263]
Deep neural networks for video classification may be subjected to adversarial manipulation.
We present a manipulation scheme for fooling video classifiers by introducing a flickering temporal perturbation.
The attack was implemented on several target models and the transferability of the attack was demonstrated.
arXiv Detail & Related papers (2020-02-12T17:58:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.