Deeply-Coupled Convolution-Transformer with Spatial-temporal
Complementary Learning for Video-based Person Re-identification
- URL: http://arxiv.org/abs/2304.14122v1
- Date: Thu, 27 Apr 2023 12:16:44 GMT
- Title: Deeply-Coupled Convolution-Transformer with Spatial-temporal
Complementary Learning for Video-based Person Re-identification
- Authors: Xuehu Liu, Chenyang Yu, Pingping Zhang and Huchuan Lu
- Abstract summary: We propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID.
Our framework could attain better performances than most state-of-the-art methods.
- Score: 91.56939957189505
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Advanced deep Convolutional Neural Networks (CNNs) have shown great success
in video-based person Re-Identification (Re-ID). However, they usually focus on
the most obvious regions of persons with a limited global representation
ability. Recently, it witnesses that Transformers explore the inter-patch
relations with global observations for performance improvements. In this work,
we take both sides and propose a novel spatial-temporal complementary learning
framework named Deeply-Coupled Convolution-Transformer (DCCT) for
high-performance video-based person Re-ID. Firstly, we couple CNNs and
Transformers to extract two kinds of visual features and experimentally verify
their complementarity. Further, in spatial, we propose a Complementary Content
Attention (CCA) to take advantages of the coupled structure and guide
independent features for spatial complementary learning. In temporal, a
Hierarchical Temporal Aggregation (HTA) is proposed to progressively capture
the inter-frame dependencies and encode temporal information. Besides, a gated
attention is utilized to deliver aggregated temporal information into the CNN
and Transformer branches for temporal complementary learning. Finally, we
introduce a self-distillation training strategy to transfer the superior
spatial-temporal knowledge to backbone networks for higher accuracy and more
efficiency. In this way, two kinds of typical features from same videos are
integrated mechanically for more informative representations. Extensive
experiments on four public Re-ID benchmarks demonstrate that our framework
could attain better performances than most state-of-the-art methods.
Related papers
- Spatial-Temporal Knowledge-Embedded Transformer for Video Scene Graph
Generation [64.85974098314344]
Video scene graph generation (VidSGG) aims to identify objects in visual scenes and infer their relationships for a given video.
Inherently, object pairs and their relationships enjoy spatial co-occurrence correlations within each image and temporal consistency/transition correlations across different images.
We propose a spatial-temporal knowledge-embedded transformer (STKET) that incorporates the prior spatial-temporal knowledge into the multi-head cross-attention mechanism.
arXiv Detail & Related papers (2023-09-23T02:40:28Z) - Self-Supervised Video Representation Learning via Latent Time Navigation [12.721647696921865]
Self-supervised video representation learning aims at maximizing similarity between different temporal segments of one video.
We propose Latent Time Navigation (LTN) to capture fine-grained motions.
Our experimental analysis suggests that learning video representations by LTN consistently improves performance of action classification.
arXiv Detail & Related papers (2023-05-10T20:06:17Z) - FuTH-Net: Fusing Temporal Relations and Holistic Features for Aerial
Video Classification [49.06447472006251]
We propose a novel deep neural network, termed FuTH-Net, to model not only holistic features, but also temporal relations for aerial video classification.
Our model is evaluated on two aerial video classification datasets, ERA and Drone-Action, and achieves the state-of-the-art results.
arXiv Detail & Related papers (2022-09-22T21:15:58Z) - Combined CNN Transformer Encoder for Enhanced Fine-grained Human Action
Recognition [11.116921653535226]
We investigate two frameworks that combine CNN vision backbone and Transformer to enhance fine-grained action recognition.
Our experimental results show that both our Transformer encoder frameworks effectively learn latent temporal semantics and cross-modality association.
We achieve new state-of-the-art performance on the FineGym benchmark dataset for both proposed architectures.
arXiv Detail & Related papers (2022-08-03T08:01:55Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z) - Dense Interaction Learning for Video-based Person Re-identification [75.03200492219003]
We propose a hybrid framework, Dense Interaction Learning (DenseIL), to tackle video-based person re-ID difficulties.
DenseIL contains a CNN encoder and a Dense Interaction (DI) decoder.
Our experiments consistently and significantly outperform all the state-of-the-art methods on multiple standard video-based re-ID datasets.
arXiv Detail & Related papers (2021-03-16T12:22:08Z) - GTA: Global Temporal Attention for Video Action Understanding [51.476605514802806]
We introduce Global Temporal Attention (AGT), which performs global temporal attention on top of spatial attention in a decoupled manner.
Tests on 2D and 3D networks demonstrate that our approach consistently enhances temporal modeling and provides state-of-the-art performance on three video action recognition datasets.
arXiv Detail & Related papers (2020-12-15T18:58:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.