Multi-Stage Spatio-Temporal Aggregation Transformer for Video Person
Re-identification
- URL: http://arxiv.org/abs/2301.00531v1
- Date: Mon, 2 Jan 2023 05:17:31 GMT
- Title: Multi-Stage Spatio-Temporal Aggregation Transformer for Video Person
Re-identification
- Authors: Ziyi Tang, Ruimao Zhang, Zhanglin Peng, Jinrui Chen, Liang Lin
- Abstract summary: We propose a novel Multi-Stage Spatial-Temporal Aggregation Transformer (MSTAT) with two novel designed proxy embedding modules.
MSTAT consists of three stages to encode the attribute-associated, the identity-associated, and the attribute-identity-associated information from the video clips.
We show that MSTAT can achieve state-of-the-art accuracies on various standard benchmarks.
- Score: 78.08536797239893
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, the Transformer architecture has shown its superiority in
the video-based person re-identification task. Inspired by video representation
learning, these methods mainly focus on designing modules to extract
informative spatial and temporal features. However, they are still limited in
extracting local attributes and global identity information, which are critical
for the person re-identification task. In this paper, we propose a novel
Multi-Stage Spatial-Temporal Aggregation Transformer (MSTAT) with two novel
designed proxy embedding modules to address the above issue. Specifically,
MSTAT consists of three stages to encode the attribute-associated, the
identity-associated, and the attribute-identity-associated information from the
video clips, respectively, achieving the holistic perception of the input
person. We combine the outputs of all the stages for the final identification.
In practice, to save the computational cost, the Spatial-Temporal Aggregation
(STA) modules are first adopted in each stage to conduct the self-attention
operations along the spatial and temporal dimensions separately. We further
introduce the Attribute-Aware and Identity-Aware Proxy embedding modules (AAP
and IAP) to extract the informative and discriminative feature representations
at different stages. All of them are realized by employing newly designed
self-attention operations with specific meanings. Moreover, temporal patch
shuffling is also introduced to further improve the robustness of the model.
Extensive experimental results demonstrate the effectiveness of the proposed
modules in extracting the informative and discriminative information from the
videos, and illustrate the MSTAT can achieve state-of-the-art accuracies on
various standard benchmarks.
Related papers
- Enhancing Visible-Infrared Person Re-identification with Modality- and Instance-aware Visual Prompt Learning [29.19130646630545]
We introduce the Modality-aware and Instance-aware Visual Prompts (MIP) network in our work.
MIP is designed to effectively utilize both invariant and specific information for identification.
Our proposed MIP performs better than most state-of-the-art methods.
arXiv Detail & Related papers (2024-06-18T06:39:03Z) - Dynamic Patch-aware Enrichment Transformer for Occluded Person
Re-Identification [14.219232629274186]
We present an end-to-end solution known as the Dynamic Patch-aware Enrichment Transformer (DPEFormer)
This model effectively distinguishes human body information from occlusions automatically and dynamically.
To ensure that DPSM and the entire DPEFormer can effectively learn with only identity labels, we also propose a Realistic Occlusion Augmentation (ROA) strategy.
arXiv Detail & Related papers (2024-02-16T03:53:30Z) - Multi-Prompts Learning with Cross-Modal Alignment for Attribute-based
Person Re-Identification [18.01407937934588]
We present a new framework called Multi-Prompts ReID (MP-ReID) based on prompt learning and language models.
MP-ReID learns to hallucinate diverse, informative, and promptable sentences for describing the query images.
Explicit prompts are obtained by ensembling generation models, such as ChatGPT and VQA models.
arXiv Detail & Related papers (2023-12-28T03:00:19Z) - RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation.
Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal.
We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z) - Feature Disentanglement Learning with Switching and Aggregation for
Video-based Person Re-Identification [9.068045610800667]
In video person re-identification (Re-ID), the network must consistently extract features of the target person from successive frames.
Existing methods tend to focus only on how to use temporal information, which often leads to networks being fooled by similar appearances and same backgrounds.
We propose a Disentanglement and Switching and Aggregation Network (DSANet), which segregates the features representing identity and features based on camera characteristics, and pays more attention to ID information.
arXiv Detail & Related papers (2022-12-16T04:27:56Z) - Dynamic Prototype Mask for Occluded Person Re-Identification [88.7782299372656]
Existing methods mainly address this issue by employing body clues provided by an extra network to distinguish the visible part.
We propose a novel Dynamic Prototype Mask (DPM) based on two self-evident prior knowledge.
Under this condition, the occluded representation could be well aligned in a selected subspace spontaneously.
arXiv Detail & Related papers (2022-07-19T03:31:13Z) - Identity-aware Graph Memory Network for Action Detection [37.65846189707054]
We explicitly highlight the identity information of the actors in terms of both long-term and short-term context through a graph memory network.
Specifically, we propose the hierarchical graph neural network (IGNN) to comprehensively conduct long-term relation modeling.
We develop a dual attention module (DAM) to generate identity-aware constraint to reduce the influence of interference by the actors of different identities.
arXiv Detail & Related papers (2021-08-26T02:34:55Z) - Identity-Aware Multi-Sentence Video Description [105.13845996039277]
We introduce an auxiliary task of Fill-in the Identity, that aims to predict persons' IDs consistently within a set of clips.
One of the key components is a gender-aware textual representation as well an additional gender prediction objective in the main model.
Experiments show that our proposed Fill-in the Identity model is superior to several baselines and recent works.
arXiv Detail & Related papers (2020-08-22T09:50:43Z) - Attribute-aware Identity-hard Triplet Loss for Video-based Person
Re-identification [51.110453988705395]
Video-based person re-identification (Re-ID) is an important computer vision task.
We introduce a new metric learning method called Attribute-aware Identity-hard Triplet Loss (AITL)
To achieve a complete model of video-based person Re-ID, a multi-task framework with Attribute-driven Spatio-Temporal Attention (ASTA) mechanism is also proposed.
arXiv Detail & Related papers (2020-06-13T09:15:38Z) - Multi-Granularity Reference-Aided Attentive Feature Aggregation for
Video-based Person Re-identification [98.7585431239291]
Video-based person re-identification aims at matching the same person across video clips.
In this paper, we propose an attentive feature aggregation module, namely Multi-Granularity Reference-Attentive Feature aggregation module MG-RAFA.
Our framework achieves the state-of-the-art ablation performance on three benchmark datasets.
arXiv Detail & Related papers (2020-03-27T03:49:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.