Multi-Granularity Reference-Aided Attentive Feature Aggregation for
Video-based Person Re-identification
- URL: http://arxiv.org/abs/2003.12224v1
- Date: Fri, 27 Mar 2020 03:49:21 GMT
- Title: Multi-Granularity Reference-Aided Attentive Feature Aggregation for
Video-based Person Re-identification
- Authors: Zhizheng Zhang, Cuiling Lan, Wenjun Zeng, Zhibo Chen
- Abstract summary: Video-based person re-identification aims at matching the same person across video clips.
In this paper, we propose an attentive feature aggregation module, namely Multi-Granularity Reference-Attentive Feature aggregation module MG-RAFA.
Our framework achieves the state-of-the-art ablation performance on three benchmark datasets.
- Score: 98.7585431239291
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video-based person re-identification (reID) aims at matching the same person
across video clips. It is a challenging task due to the existence of redundancy
among frames, newly revealed appearance, occlusion, and motion blurs. In this
paper, we propose an attentive feature aggregation module, namely
Multi-Granularity Reference-aided Attentive Feature Aggregation (MG-RAFA), to
delicately aggregate spatio-temporal features into a discriminative video-level
feature representation. In order to determine the contribution/importance of a
spatial-temporal feature node, we propose to learn the attention from a global
view with convolutional operations. Specifically, we stack its relations, i.e.,
pairwise correlations with respect to a representative set of reference feature
nodes (S-RFNs) that represents global video information, together with the
feature itself to infer the attention. Moreover, to exploit the semantics of
different levels, we propose to learn multi-granularity attentions based on the
relations captured at different granularities. Extensive ablation studies
demonstrate the effectiveness of our attentive feature aggregation module
MG-RAFA. Our framework achieves the state-of-the-art performance on three
benchmark datasets.
Related papers
- REACT: Recognize Every Action Everywhere All At Once [8.10024991952397]
Group Activity Decoder (GAR) is a fundamental problem in computer vision, with diverse applications in sports analysis, surveillance, and social scene understanding.
We present REACT, an architecture inspired by the transformer encoder-decoder model.
Our method outperforms state-of-the-art GAR approaches in extensive experiments, demonstrating superior accuracy in recognizing and understanding group activities.
arXiv Detail & Related papers (2023-11-27T20:48:54Z) - RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation.
Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal.
We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z) - Feature Disentanglement Learning with Switching and Aggregation for
Video-based Person Re-Identification [9.068045610800667]
In video person re-identification (Re-ID), the network must consistently extract features of the target person from successive frames.
Existing methods tend to focus only on how to use temporal information, which often leads to networks being fooled by similar appearances and same backgrounds.
We propose a Disentanglement and Switching and Aggregation Network (DSANet), which segregates the features representing identity and features based on camera characteristics, and pays more attention to ID information.
arXiv Detail & Related papers (2022-12-16T04:27:56Z) - Dual Information Enhanced Multi-view Attributed Graph Clustering [11.624319530337038]
A novel Dual Information enhanced multi-view Attributed Graph Clustering (DIAGC) method is proposed in this paper.
The proposed method introduces the Specific Information Reconstruction (SIR) module to disentangle the explorations of the consensus and specific information from multiple views.
The Mutual Information Maximization (MIM) module maximizes the agreement between the latent high-level representation and low-level ones, and enables the high-level representation to satisfy the desired clustering structure.
arXiv Detail & Related papers (2022-11-28T01:18:04Z) - Learning Multi-Granular Hypergraphs for Video-Based Person
Re-Identification [110.52328716130022]
Video-based person re-identification (re-ID) is an important research topic in computer vision.
We propose a novel graph-based framework, namely Multi-Granular Hypergraph (MGH) to better representational capabilities.
90.0% top-1 accuracy on MARS is achieved using MGH, outperforming the state-of-the-arts schemes.
arXiv Detail & Related papers (2021-04-30T11:20:02Z) - Watching You: Global-guided Reciprocal Learning for Video-based Person
Re-identification [82.6971648465279]
We propose a novel Global-guided Reciprocal Learning framework for video-based person Re-ID.
Our approach can achieve better performance than other state-of-the-art approaches.
arXiv Detail & Related papers (2021-03-07T12:27:42Z) - Co-Saliency Spatio-Temporal Interaction Network for Person
Re-Identification in Videos [85.6430597108455]
We propose a novel Co-Saliency Spatio-Temporal Interaction Network (CSTNet) for person re-identification in videos.
It captures the common salient foreground regions among video frames and explores the spatial-temporal long-range context interdependency from such regions.
Multiple spatialtemporal interaction modules within CSTNet are proposed, which exploit the spatial and temporal long-range context interdependencies on such features and spatial-temporal information correlation.
arXiv Detail & Related papers (2020-04-10T10:23:58Z) - See More, Know More: Unsupervised Video Object Segmentation with
Co-Attention Siamese Networks [184.4379622593225]
We introduce a novel network, called CO-attention Siamese Network (COSNet), to address the unsupervised video object segmentation task.
We emphasize the importance of inherent correlation among video frames and incorporate a global co-attention mechanism.
We propose a unified and end-to-end trainable framework where different co-attention variants can be derived for mining the rich context within videos.
arXiv Detail & Related papers (2020-01-19T11:10:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.