Related papers: Feature Disentanglement Learning with Switching and Aggregation for Video-based Person Re-Identification

Feature Disentanglement Learning with Switching and Aggregation for Video-based Person Re-Identification

URL: http://arxiv.org/abs/2212.09498v1
Date: Fri, 16 Dec 2022 04:27:56 GMT
Title: Feature Disentanglement Learning with Switching and Aggregation for Video-based Person Re-Identification
Authors: Minjung Kim, MyeongAh Cho, Sangyoun Lee
Abstract summary: In video person re-identification (Re-ID), the network must consistently extract features of the target person from successive frames. Existing methods tend to focus only on how to use temporal information, which often leads to networks being fooled by similar appearances and same backgrounds. We propose a Disentanglement and Switching and Aggregation Network (DSANet), which segregates the features representing identity and features based on camera characteristics, and pays more attention to ID information.
Score: 9.068045610800667
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: In video person re-identification (Re-ID), the network must consistently extract features of the target person from successive frames. Existing methods tend to focus only on how to use temporal information, which often leads to networks being fooled by similar appearances and same backgrounds. In this paper, we propose a Disentanglement and Switching and Aggregation Network (DSANet), which segregates the features representing identity and features based on camera characteristics, and pays more attention to ID information. We also introduce an auxiliary task that utilizes a new pair of features created through switching and aggregation to increase the network's capability for various camera scenarios. Furthermore, we devise a Target Localization Module (TLM) that extracts robust features against a change in the position of the target according to the frame flow and a Frame Weight Generation (FWG) that reflects temporal information in the final representation. Various loss functions for disentanglement learning are designed so that each component of the network can cooperate while satisfactorily performing its own role. Quantitative and qualitative results from extensive experiments demonstrate the superiority of DSANet over state-of-the-art methods on three benchmark datasets.

Related papers

Deeply-Coupled Convolution-Transformer with Spatial-temporal Complementary Learning for Video-based Person Re-identification [91.56939957189505]
We propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID. Our framework could attain better performances than most state-of-the-art methods.
arXiv Detail & Related papers (2023-04-27T12:16:44Z)
Multi-Stage Spatio-Temporal Aggregation Transformer for Video Person Re-identification [78.08536797239893]
We propose a novel Multi-Stage Spatial-Temporal Aggregation Transformer (MSTAT) with two novel designed proxy embedding modules. MSTAT consists of three stages to encode the attribute-associated, the identity-associated, and the attribute-identity-associated information from the video clips. We show that MSTAT can achieve state-of-the-art accuracies on various standard benchmarks.
arXiv Detail & Related papers (2023-01-02T05:17:31Z)
Counting with Adaptive Auxiliary Learning [23.715818463425503]
This paper proposes an adaptive auxiliary task learning based approach for object counting problems. We develop an attention-enhanced adaptively shared backbone network to enable both task-shared and task-tailored features learning. Our method achieves superior performance to the state-of-the-art auxiliary task learning based counting methods.
arXiv Detail & Related papers (2022-03-08T13:10:17Z)
CMTR: Cross-modality Transformer for Visible-infrared Person Re-identification [38.96033760300123]
Cross-modality transformer-based method (CMTR) for visible-infrared person re-identification task. We design the novel modality embeddings, which are fused with token embeddings to encode modalities' information. Our proposed CMTR model's performance significantly surpasses existing outstanding CNN-based methods.
arXiv Detail & Related papers (2021-10-18T03:12:59Z)
Spatio-Temporal Representation Factorization for Video-based Person Re-Identification [55.01276167336187]
We propose Spatio-Temporal Representation Factorization module (STRF) for re-ID. STRF is a flexible new computational unit that can be used in conjunction with most existing 3D convolutional neural network architectures for re-ID. We empirically show that STRF improves performance of various existing baseline architectures while demonstrating new state-of-the-art results.
arXiv Detail & Related papers (2021-07-25T19:29:37Z)
AXM-Net: Cross-Modal Context Sharing Attention Network for Person Re-ID [20.700750237972155]
Cross-modal person re-identification (Re-ID) is critical for modern video surveillance systems. Key challenge is to align inter-modality representations according to semantic information present for a person and ignore background information. We present AXM-Net, a novel CNN based architecture designed for learning semantically aligned visual and textual representations.
arXiv Detail & Related papers (2021-01-19T16:06:39Z)
A Flow-Guided Mutual Attention Network for Video-Based Person Re-Identification [25.217641512619178]
Person ReID is a challenging problem in many analytics and surveillance applications. Video-based person ReID has recently gained much interest because it allows capturing feature discriminant-temporal information. In this paper, the motion pattern of a person is explored as an additional cue for ReID.
arXiv Detail & Related papers (2020-08-09T18:58:11Z)
Temporal Complementary Learning for Video Person Re-Identification [110.43147302200101]
This paper proposes a Temporal Complementary Learning Network that extracts complementary features of consecutive video frames for video person re-identification. A saliency erasing operation drives the specific learner to mine new and complementary parts by erasing the parts activated by previous frames. A Temporal Saliency Boosting (TSB) module is designed to propagate the salient information among video frames to enhance the salient feature.
arXiv Detail & Related papers (2020-07-18T07:59:01Z)
Co-Saliency Spatio-Temporal Interaction Network for Person Re-Identification in Videos [85.6430597108455]
We propose a novel Co-Saliency Spatio-Temporal Interaction Network (CSTNet) for person re-identification in videos. It captures the common salient foreground regions among video frames and explores the spatial-temporal long-range context interdependency from such regions. Multiple spatialtemporal interaction modules within CSTNet are proposed, which exploit the spatial and temporal long-range context interdependencies on such features and spatial-temporal information correlation.
arXiv Detail & Related papers (2020-04-10T10:23:58Z)
Multi-Granularity Reference-Aided Attentive Feature Aggregation for Video-based Person Re-identification [98.7585431239291]
Video-based person re-identification aims at matching the same person across video clips. In this paper, we propose an attentive feature aggregation module, namely Multi-Granularity Reference-Attentive Feature aggregation module MG-RAFA. Our framework achieves the state-of-the-art ablation performance on three benchmark datasets.
arXiv Detail & Related papers (2020-03-27T03:49:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.