Co-Saliency Spatio-Temporal Interaction Network for Person
Re-Identification in Videos
- URL: http://arxiv.org/abs/2004.04979v2
- Date: Mon, 11 May 2020 10:04:19 GMT
- Title: Co-Saliency Spatio-Temporal Interaction Network for Person
Re-Identification in Videos
- Authors: Jiawei Liu, Zheng-Jun Zha, Xierong Zhu, Na Jiang
- Abstract summary: We propose a novel Co-Saliency Spatio-Temporal Interaction Network (CSTNet) for person re-identification in videos.
It captures the common salient foreground regions among video frames and explores the spatial-temporal long-range context interdependency from such regions.
Multiple spatialtemporal interaction modules within CSTNet are proposed, which exploit the spatial and temporal long-range context interdependencies on such features and spatial-temporal information correlation.
- Score: 85.6430597108455
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Person re-identification aims at identifying a certain pedestrian across
non-overlapping camera networks. Video-based re-identification approaches have
gained significant attention recently, expanding image-based approaches by
learning features from multiple frames. In this work, we propose a novel
Co-Saliency Spatio-Temporal Interaction Network (CSTNet) for person
re-identification in videos. It captures the common salient foreground regions
among video frames and explores the spatial-temporal long-range context
interdependency from such regions, towards learning discriminative pedestrian
representation. Specifically, multiple co-saliency learning modules within
CSTNet are designed to utilize the correlated information across video frames
to extract the salient features from the task-relevant regions and suppress
background interference. Moreover, multiple spatialtemporal interaction modules
within CSTNet are proposed, which exploit the spatial and temporal long-range
context interdependencies on such features and spatial-temporal information
correlation, to enhance feature representation. Extensive experiments on two
benchmarks have demonstrated the effectiveness of the proposed method.
Related papers
- Video Salient Object Detection via Contrastive Features and Attention
Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection.
A co-attention formulation is utilized to combine the low-level and high-level features.
We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z) - Visual Spatio-temporal Relation-enhanced Network for Cross-modal
Text-Video Retrieval [17.443195531553474]
Cross-modal retrieval of texts and videos aims to understand the correspondence between vision and language.
We propose a Visual S-temporal Relation-enhanced semantic network (CNN-SRNet), a cross-temporal retrieval framework.
Experiments are conducted on both MSR-VTT and MSVD datasets.
arXiv Detail & Related papers (2021-10-29T08:23:40Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z) - Multiple Object Tracking with Correlation Learning [16.959379957515974]
We propose to exploit the local correlation module to model the topological relationship between targets and their surrounding environment.
Specifically, we establish dense correspondences of each spatial location and its context, and explicitly constrain the correlation volumes through self-supervised learning.
Our approach demonstrates the effectiveness of correlation learning with the superior performance and obtains state-of-the-art MOTA of 76.5% and IDF1 of 73.6% on MOT17.
arXiv Detail & Related papers (2021-04-08T06:48:02Z) - Dense Interaction Learning for Video-based Person Re-identification [75.03200492219003]
We propose a hybrid framework, Dense Interaction Learning (DenseIL), to tackle video-based person re-ID difficulties.
DenseIL contains a CNN encoder and a Dense Interaction (DI) decoder.
Our experiments consistently and significantly outperform all the state-of-the-art methods on multiple standard video-based re-ID datasets.
arXiv Detail & Related papers (2021-03-16T12:22:08Z) - Temporal Attribute-Appearance Learning Network for Video-based Person
Re-Identification [94.03477970865772]
We propose a novel Temporal Attribute-Appearance Learning Network (TALNet) for video-based person re-identification.
TALNet exploits human attributes and appearance to learn comprehensive and effective pedestrian representations from videos.
arXiv Detail & Related papers (2020-09-09T09:28:07Z) - Multi-Granularity Reference-Aided Attentive Feature Aggregation for
Video-based Person Re-identification [98.7585431239291]
Video-based person re-identification aims at matching the same person across video clips.
In this paper, we propose an attentive feature aggregation module, namely Multi-Granularity Reference-Attentive Feature aggregation module MG-RAFA.
Our framework achieves the state-of-the-art ablation performance on three benchmark datasets.
arXiv Detail & Related papers (2020-03-27T03:49:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.