Reference-Aided Part-Aligned Feature Disentangling for Video Person
Re-Identification
- URL: http://arxiv.org/abs/2103.11319v1
- Date: Sun, 21 Mar 2021 06:53:57 GMT
- Title: Reference-Aided Part-Aligned Feature Disentangling for Video Person
Re-Identification
- Authors: Guoqing Zhang, Yuhao Chen, Yang Dai, Yuhui Zheng, Yi Wu
- Abstract summary: We propose a textbfReference-textbfAided textbfPart-textbfAligned (textbfRAPA) framework to disentangle robust features of different parts.
By using both modules, the informative parts of pedestrian in videos are well aligned and more discriminative feature representation is generated.
- Score: 18.13546384207381
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, video-based person re-identification (re-ID) has drawn increasing
attention in compute vision community because of its practical application
prospects. Due to the inaccurate person detections and pose changes, pedestrian
misalignment significantly increases the difficulty of feature extraction and
matching. To address this problem, in this paper, we propose a
\textbf{R}eference-\textbf{A}ided \textbf{P}art-\textbf{A}ligned
(\textbf{RAPA}) framework to disentangle robust features of different parts.
Firstly, in order to obtain better references between different videos, a
pose-based reference feature learning module is introduced. Secondly, an
effective relation-based part feature disentangling module is explored to align
frames within each video. By means of using both modules, the informative parts
of pedestrian in videos are well aligned and more discriminative feature
representation is generated. Comprehensive experiments on three widely-used
benchmarks, i.e. iLIDS-VID, PRID-2011 and MARS datasets verify the
effectiveness of the proposed framework. Our code will be made publicly
available.
Related papers
- OFFSET: Segmentation-based Focus Shift Revision for Composed Image Retrieval [59.377821673653436]
Composed Image Retrieval (CIR) is capable of expressing users' intricate retrieval requirements flexibly.<n>CIR remains in its nascent stages due to two limitations: 1) inhomogeneity between dominant and noisy portions in visual data is ignored, leading to query feature degradation.<n>This work presents a focus mapping-based feature extractor, which consists of two modules: dominant portion segmentation and dual focus mapping.
arXiv Detail & Related papers (2025-07-08T03:27:46Z) - Scoring, Remember, and Reference: Catching Camouflaged Objects in Videos [24.03405963900272]
Video Camouflaged Object Detection aims to segment objects whose appearances closely resemble their surroundings.<n>Existing vision models often struggle in such scenarios due to the indistinguishable appearance of camouflaged objects.<n>We propose an end-to-end framework inspired by human memory-recognition.
arXiv Detail & Related papers (2025-03-21T11:08:14Z) - VrdONE: One-stage Video Visual Relation Detection [30.983521962897477]
Video Visual Relation Detection (VidVRD) focuses on understanding how entities over time and space in videos.
Traditional methods for VidVRD, challenged by its complexity, typically split the task into two parts: one for identifying what relation are present and another for determining their temporal boundaries.
We propose VrdONE, a streamlined yet efficacious one-stage model for VidVRD.
arXiv Detail & Related papers (2024-08-18T08:38:20Z) - Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial
Margin Contrastive Learning [35.404100473539195]
Text-video retrieval aims to rank relevant text/video higher than irrelevant ones.
Recent contrastive learning methods have shown promising results for text-video retrieval.
This paper improves contrastive learning using two novel techniques.
arXiv Detail & Related papers (2023-09-20T06:08:11Z) - Video Infringement Detection via Feature Disentanglement and Mutual
Information Maximization [51.206398602941405]
We propose to disentangle an original high-dimensional feature into multiple sub-features.
On top of the disentangled sub-features, we learn an auxiliary feature to enhance the sub-features.
Our method achieves 90.1% TOP-100 mAP on the large-scale SVD dataset and also sets the new state-of-the-art on the VCSL benchmark dataset.
arXiv Detail & Related papers (2023-09-13T10:53:12Z) - Mitigating Representation Bias in Action Recognition: Algorithms and
Benchmarks [76.35271072704384]
Deep learning models perform poorly when applied to videos with rare scenes or objects.
We tackle this problem from two different angles: algorithm and dataset.
We show that the debiased representation can generalize better when transferred to other datasets and tasks.
arXiv Detail & Related papers (2022-09-20T00:30:35Z) - Correspondence Matters for Video Referring Expression Comprehension [64.60046797561455]
Video Referring Expression (REC) aims to localize the referent objects described in the sentence to visual regions in the video frames.
Existing methods suffer from two problems: 1) inconsistent localization results across video frames; 2) confusion between the referent and contextual objects.
We propose a novel Dual Correspondence Network (dubbed as DCNet) which explicitly enhances the dense associations in both the inter-frame and cross-modal manners.
arXiv Detail & Related papers (2022-07-21T10:31:39Z) - Exploring Motion and Appearance Information for Temporal Sentence
Grounding [52.01687915910648]
We propose a Motion-Appearance Reasoning Network (MARN) to solve temporal sentence grounding.
We develop separate motion and appearance branches to learn motion-guided and appearance-guided object relations.
Our proposed MARN significantly outperforms previous state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-01-03T02:44:18Z) - Video-Text Pre-training with Learned Regions [59.30893505895156]
Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs.
We propose a module for videotext-learning, RegionLearner, which can take into account the structure of objects during pre-training on large-scale video-text pairs.
arXiv Detail & Related papers (2021-12-02T13:06:53Z) - Support-Set Based Cross-Supervision for Video Grounding [98.29089558426399]
Support-set Based Cross-Supervision (Sscs) module can improve existing methods during training phase without extra inference cost.
The proposed Sscs module contains two main components, i.e., discriminative contrastive objective and generative caption objective.
We extensively evaluate Sscs on three challenging datasets, and show that our method can improve current state-of-the-art methods by large margins.
arXiv Detail & Related papers (2021-08-24T08:25:26Z) - Learning Multi-Granular Hypergraphs for Video-Based Person
Re-Identification [110.52328716130022]
Video-based person re-identification (re-ID) is an important research topic in computer vision.
We propose a novel graph-based framework, namely Multi-Granular Hypergraph (MGH) to better representational capabilities.
90.0% top-1 accuracy on MARS is achieved using MGH, outperforming the state-of-the-arts schemes.
arXiv Detail & Related papers (2021-04-30T11:20:02Z) - FOCAL: A Forgery Localization Framework based on Video Coding
Self-Consistency [26.834506269499094]
This paper presents a video forgery localization framework that verifies the self-consistency of coding traces between and within video frames.
The overall framework was validated in two typical forgery scenarios: temporal and spatial splicing.
Experimental results show an improvement to the state-of-the-art on temporal splicing localization and also promising performance in the newly tackled case of spatial splicing.
arXiv Detail & Related papers (2020-08-24T13:55:14Z) - ESA-ReID: Entropy-Based Semantic Feature Alignment for Person re-ID [7.978877859859102]
Person re-identification (re-ID) is a challenging task in real-world. Besides the typical application in surveillance system, re-ID also has significant values to improve the recall rate of people identification in content video (TV or Movies)
In this paper we propose an entropy based semantic feature alignment model, which takes advantages of the detailed information of the human semantic feature.
Considering the uncertainty of semantic segmentation, we introduce a semantic alignment with an entropy-based mask which can reduce the negative effects of mask segmentation errors.
arXiv Detail & Related papers (2020-07-09T08:56:28Z) - Exploiting Visual Semantic Reasoning for Video-Text Retrieval [14.466809435818984]
We propose a Visual Semantic Enhanced Reasoning Network (ViSERN) to exploit reasoning between frame regions.
We perform reasoning by novel random walk rule-based graph convolutional networks to generate region features involved with semantic relations.
With the benefit of reasoning, semantic interactions between regions are considered, while the impact of redundancy is suppressed.
arXiv Detail & Related papers (2020-06-16T02:56:46Z) - Multi-Granularity Reference-Aided Attentive Feature Aggregation for
Video-based Person Re-identification [98.7585431239291]
Video-based person re-identification aims at matching the same person across video clips.
In this paper, we propose an attentive feature aggregation module, namely Multi-Granularity Reference-Attentive Feature aggregation module MG-RAFA.
Our framework achieves the state-of-the-art ablation performance on three benchmark datasets.
arXiv Detail & Related papers (2020-03-27T03:49:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.