Temporal Attribute-Appearance Learning Network for Video-based Person
Re-Identification
- URL: http://arxiv.org/abs/2009.04181v1
- Date: Wed, 9 Sep 2020 09:28:07 GMT
- Title: Temporal Attribute-Appearance Learning Network for Video-based Person
Re-Identification
- Authors: Jiawei Liu, Xierong Zhu, Zheng-Jun Zha
- Abstract summary: We propose a novel Temporal Attribute-Appearance Learning Network (TALNet) for video-based person re-identification.
TALNet exploits human attributes and appearance to learn comprehensive and effective pedestrian representations from videos.
- Score: 94.03477970865772
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video-based person re-identification aims to match a specific pedestrian in
surveillance videos across different time and locations. Human attributes and
appearance are complementary to each other, both of them contribute to
pedestrian matching. In this work, we propose a novel Temporal
Attribute-Appearance Learning Network (TALNet) for video-based person
re-identification. TALNet simultaneously exploits human attributes and
appearance to learn comprehensive and effective pedestrian representations from
videos. It explores hard visual attention and temporal-semantic context for
attributes, and spatial-temporal dependencies among body parts for appearance,
to boost the learning of them. Specifically, an attribute branch network is
proposed with a spatial attention block and a temporal-semantic context block
for learning robust attribute representation. The spatial attention block
focuses the network on corresponding regions within video frames related to
each attribute, the temporal-semantic context block learns both the temporal
context for each attribute across video frames and the semantic context among
attributes in each video frame. The appearance branch network is designed to
learn effective appearance representation from both whole body and body parts
with spatial-temporal dependencies among them. TALNet leverages the
complementation between attribute and appearance representations, and jointly
optimizes them by multi-task learning fashion. Moreover, we annotate ID-level
attributes for each pedestrian in the two commonly used video datasets.
Extensive experiments on these datasets, have verified the superiority of
TALNet over state-of-the-art methods.
Related papers
- Deeply-Coupled Convolution-Transformer with Spatial-temporal
Complementary Learning for Video-based Person Re-identification [91.56939957189505]
We propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID.
Our framework could attain better performances than most state-of-the-art methods.
arXiv Detail & Related papers (2023-04-27T12:16:44Z) - Visual Spatio-temporal Relation-enhanced Network for Cross-modal
Text-Video Retrieval [17.443195531553474]
Cross-modal retrieval of texts and videos aims to understand the correspondence between vision and language.
We propose a Visual S-temporal Relation-enhanced semantic network (CNN-SRNet), a cross-temporal retrieval framework.
Experiments are conducted on both MSR-VTT and MSVD datasets.
arXiv Detail & Related papers (2021-10-29T08:23:40Z) - Efficient Modelling Across Time of Human Actions and Interactions [92.39082696657874]
We argue that current fixed-sized-temporal kernels in 3 convolutional neural networks (CNNDs) can be improved to better deal with temporal variations in the input.
We study how we can better handle between classes of actions, by enhancing their feature differences over different layers of the architecture.
The proposed approaches are evaluated on several benchmark action recognition datasets and show competitive results.
arXiv Detail & Related papers (2021-10-05T15:39:11Z) - SSAN: Separable Self-Attention Network for Video Representation Learning [11.542048296046524]
We propose a separable self-attention (SSA) module, which models spatial and temporal correlations sequentially.
By adding SSA module into 2D CNN, we build a SSA network (SSAN) for video representation learning.
Our approach outperforms state-of-the-art methods on Something-Something and Kinetics-400 datasets.
arXiv Detail & Related papers (2021-05-27T10:02:04Z) - TSPNet: Hierarchical Feature Learning via Temporal Semantic Pyramid for
Sign Language Translation [101.6042317204022]
Sign language translation (SLT) aims to interpret sign video sequences into text-based natural language sentences.
Existing SLT models usually represent sign visual features in a frame-wise manner.
We develop a novel hierarchical sign video feature learning method via a temporal semantic pyramid network, called TSPNet.
arXiv Detail & Related papers (2020-10-12T05:58:09Z) - IAUnet: Global Context-Aware Feature Learning for Person
Re-Identification [106.50534744965955]
IAU block enables the feature to incorporate the globally spatial, temporal, and channel context.
It is lightweight, end-to-end trainable, and can be easily plugged into existing CNNs to form IAUnet.
Experiments show that IAUnet performs favorably against state-of-the-art on both image and video reID tasks.
arXiv Detail & Related papers (2020-09-02T13:07:10Z) - Co-Saliency Spatio-Temporal Interaction Network for Person
Re-Identification in Videos [85.6430597108455]
We propose a novel Co-Saliency Spatio-Temporal Interaction Network (CSTNet) for person re-identification in videos.
It captures the common salient foreground regions among video frames and explores the spatial-temporal long-range context interdependency from such regions.
Multiple spatialtemporal interaction modules within CSTNet are proposed, which exploit the spatial and temporal long-range context interdependencies on such features and spatial-temporal information correlation.
arXiv Detail & Related papers (2020-04-10T10:23:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.