Not 3D Re-ID: a Simple Single Stream 2D Convolution for Robust Video
Re-identification
- URL: http://arxiv.org/abs/2008.06318v2
- Date: Mon, 17 Aug 2020 10:49:24 GMT
- Title: Not 3D Re-ID: a Simple Single Stream 2D Convolution for Robust Video
Re-identification
- Authors: Toby P. Breckon and Aishah Alsehaim
- Abstract summary: Video-based Re-ID is an expansion of earlier image-based re-identification methods.
We show superior performance from a simple single stream 2D convolution network leveraging the ResNet50-IBN architecture.
Our approach uses best video Re-ID practice and transfer learning between datasets to outperform existing state-of-the-art approaches.
- Score: 14.785070524184649
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video-based person re-identification has received increasing attention
recently, as it plays an important role within surveillance video analysis.
Video-based Re-ID is an expansion of earlier image-based re-identification
methods by learning features from a video via multiple image frames for each
person. Most contemporary video Re-ID methods utilise complex CNNbased network
architectures using 3D convolution or multibranch networks to extract
spatial-temporal video features. By contrast, in this paper, we illustrate
superior performance from a simple single stream 2D convolution network
leveraging the ResNet50-IBN architecture to extract frame-level features
followed by temporal attention for clip level features. These clip level
features can be generalised to extract video level features by averaging
without any significant additional cost. Our approach uses best video Re-ID
practice and transfer learning between datasets to outperform existing
state-of-the-art approaches on the MARS, PRID2011 and iLIDS-VID datasets with
89:62%, 97:75%, 97:33% rank-1 accuracy respectively and with 84:61% mAP for
MARS, without reliance on complex and memory intensive 3D convolutions or
multi-stream networks architectures as found in other contemporary work.
Conversely, our work shows that global features extracted by the 2D convolution
network are a sufficient representation for robust state of the art video
Re-ID.
Related papers
- MV2MAE: Multi-View Video Masked Autoencoders [33.61642891911761]
We present a method for self-supervised learning from synchronized multi-view videos.
We use a cross-view reconstruction task to inject geometry information in the model.
Our approach is based on the masked autoencoder (MAE) framework.
arXiv Detail & Related papers (2024-01-29T05:58:23Z) - Video Infringement Detection via Feature Disentanglement and Mutual
Information Maximization [51.206398602941405]
We propose to disentangle an original high-dimensional feature into multiple sub-features.
On top of the disentangled sub-features, we learn an auxiliary feature to enhance the sub-features.
Our method achieves 90.1% TOP-100 mAP on the large-scale SVD dataset and also sets the new state-of-the-art on the VCSL benchmark dataset.
arXiv Detail & Related papers (2023-09-13T10:53:12Z) - Deeply-Coupled Convolution-Transformer with Spatial-temporal
Complementary Learning for Video-based Person Re-identification [91.56939957189505]
We propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID.
Our framework could attain better performances than most state-of-the-art methods.
arXiv Detail & Related papers (2023-04-27T12:16:44Z) - Feature Disentanglement Learning with Switching and Aggregation for
Video-based Person Re-Identification [9.068045610800667]
In video person re-identification (Re-ID), the network must consistently extract features of the target person from successive frames.
Existing methods tend to focus only on how to use temporal information, which often leads to networks being fooled by similar appearances and same backgrounds.
We propose a Disentanglement and Switching and Aggregation Network (DSANet), which segregates the features representing identity and features based on camera characteristics, and pays more attention to ID information.
arXiv Detail & Related papers (2022-12-16T04:27:56Z) - Learning Modal-Invariant and Temporal-Memory for Video-based
Visible-Infrared Person Re-Identification [46.49866514866999]
We primarily study the video-based cross-modal person Re-ID method.
We prove that with the increase of frames in a tracklet, the performance does meet more enhancement.
A novel method is proposed, which projects two modalities to a modal-invariant subspace.
arXiv Detail & Related papers (2022-08-04T04:43:52Z) - Condensing a Sequence to One Informative Frame for Video Recognition [113.3056598548736]
This paper studies a two-step alternative that first condenses the video sequence to an informative "frame"
A valid question is how to define "useful information" and then distill from a sequence down to one synthetic frame.
IFS consistently demonstrates evident improvements on image-based 2D networks and clip-based 3D networks.
arXiv Detail & Related papers (2022-01-11T16:13:43Z) - Dense Interaction Learning for Video-based Person Re-identification [75.03200492219003]
We propose a hybrid framework, Dense Interaction Learning (DenseIL), to tackle video-based person re-ID difficulties.
DenseIL contains a CNN encoder and a Dense Interaction (DI) decoder.
Our experiments consistently and significantly outperform all the state-of-the-art methods on multiple standard video-based re-ID datasets.
arXiv Detail & Related papers (2021-03-16T12:22:08Z) - Making a Case for 3D Convolutions for Object Segmentation in Videos [16.167397418720483]
We show that 3D convolutional networks can be effectively applied to dense video prediction tasks such as salient object segmentation.
We propose a 3D decoder architecture, that comprises novel 3D Global Convolution layers and 3D Refinement modules.
Our approach outperforms existing state-of-the-arts by a large margin on the DAVIS'16 Unsupervised, FBMS and ViSal benchmarks.
arXiv Detail & Related papers (2020-08-26T12:24:23Z) - A Flow-Guided Mutual Attention Network for Video-Based Person
Re-Identification [25.217641512619178]
Person ReID is a challenging problem in many analytics and surveillance applications.
Video-based person ReID has recently gained much interest because it allows capturing feature discriminant-temporal information.
In this paper, the motion pattern of a person is explored as an additional cue for ReID.
arXiv Detail & Related papers (2020-08-09T18:58:11Z) - Temporal Distinct Representation Learning for Action Recognition [139.93983070642412]
Two-Dimensional Convolutional Neural Network (2D CNN) is used to characterize videos.
Different frames of a video share the same 2D CNN kernels, which may result in repeated and redundant information utilization.
We propose a sequential channel filtering mechanism to excite the discriminative channels of features from different frames step by step, and thus avoid repeated information extraction.
Our method is evaluated on benchmark temporal reasoning datasets Something-Something V1 and V2, and it achieves visible improvements over the best competitor by 2.4% and 1.3%, respectively.
arXiv Detail & Related papers (2020-07-15T11:30:40Z) - Attribute-aware Identity-hard Triplet Loss for Video-based Person
Re-identification [51.110453988705395]
Video-based person re-identification (Re-ID) is an important computer vision task.
We introduce a new metric learning method called Attribute-aware Identity-hard Triplet Loss (AITL)
To achieve a complete model of video-based person Re-ID, a multi-task framework with Attribute-driven Spatio-Temporal Attention (ASTA) mechanism is also proposed.
arXiv Detail & Related papers (2020-06-13T09:15:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.