Spatio-Temporal Representation Factorization for Video-based Person
Re-Identification
- URL: http://arxiv.org/abs/2107.11878v1
- Date: Sun, 25 Jul 2021 19:29:37 GMT
- Title: Spatio-Temporal Representation Factorization for Video-based Person
Re-Identification
- Authors: Abhishek Aich, Meng Zheng, Srikrishna Karanam, Terrence Chen, Amit K.
Roy-Chowdhury, Ziyan Wu
- Abstract summary: We propose Spatio-Temporal Representation Factorization module (STRF) for re-ID.
STRF is a flexible new computational unit that can be used in conjunction with most existing 3D convolutional neural network architectures for re-ID.
We empirically show that STRF improves performance of various existing baseline architectures while demonstrating new state-of-the-art results.
- Score: 55.01276167336187
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite much recent progress in video-based person re-identification (re-ID),
the current state-of-the-art still suffers from common real-world challenges
such as appearance similarity among various people, occlusions, and frame
misalignment. To alleviate these problems, we propose Spatio-Temporal
Representation Factorization module (STRF), a flexible new computational unit
that can be used in conjunction with most existing 3D convolutional neural
network architectures for re-ID. The key innovations of STRF over prior work
include explicit pathways for learning discriminative temporal and spatial
features, with each component further factorized to capture complementary
person-specific appearance and motion information. Specifically, temporal
factorization comprises two branches, one each for static features (e.g., the
color of clothes) that do not change much over time, and dynamic features
(e.g., walking patterns) that change over time. Further, spatial factorization
also comprises two branches to learn both global (coarse segments) as well as
local (finer segments) appearance features, with the local features
particularly useful in cases of occlusion or spatial misalignment. These two
factorization operations taken together result in a modular architecture for
our parameter-wise economic STRF unit that can be plugged in between any two 3D
convolutional layers, resulting in an end-to-end learning framework. We
empirically show that STRF improves performance of various existing baseline
architectures while demonstrating new state-of-the-art results using standard
person re-identification evaluation protocols on three benchmarks.
Related papers
- Multi-task Learning with 3D-Aware Regularization [55.97507478913053]
We propose a structured 3D-aware regularizer which interfaces multiple tasks through the projection of features extracted from an image encoder to a shared 3D feature space.
We show that the proposed method is architecture agnostic and can be plugged into various prior multi-task backbones to improve their performance.
arXiv Detail & Related papers (2023-10-02T08:49:56Z) - Feature Decoupling-Recycling Network for Fast Interactive Segmentation [79.22497777645806]
Recent interactive segmentation methods iteratively take source image, user guidance and previously predicted mask as the input.
We propose the Feature Decoupling-Recycling Network (FDRN), which decouples the modeling components based on their intrinsic discrepancies.
arXiv Detail & Related papers (2023-08-07T12:26:34Z) - Hierarchical Spatio-Temporal Representation Learning for Gait
Recognition [6.877671230651998]
Gait recognition is a biometric technique that identifies individuals by their unique walking styles.
We propose a hierarchical-temporal representation learning framework for extracting gait features from coarse to fine.
Our method outperforms the state-of-the-art while maintaining a reasonable balance between model accuracy and complexity.
arXiv Detail & Related papers (2023-07-19T09:30:00Z) - Global-to-Local Modeling for Video-based 3D Human Pose and Shape
Estimation [53.04781510348416]
Video-based 3D human pose and shape estimations are evaluated by intra-frame accuracy and inter-frame smoothness.
We propose to structurally decouple the modeling of long-term and short-term correlations in an end-to-end framework, Global-to-Local Transformer (GLoT)
Our GLoT surpasses previous state-of-the-art methods with the lowest model parameters on popular benchmarks, i.e., 3DPW, MPI-INF-3DHP, and Human3.6M.
arXiv Detail & Related papers (2023-03-26T14:57:49Z) - Part-guided Relational Transformers for Fine-grained Visual Recognition [59.20531172172135]
We propose a framework to learn the discriminative part features and explore correlations with a feature transformation module.
Our proposed approach does not rely on additional part branches and reaches state-the-of-art performance on 3-of-the-level object recognition.
arXiv Detail & Related papers (2022-12-28T03:45:56Z) - Feature Disentanglement Learning with Switching and Aggregation for
Video-based Person Re-Identification [9.068045610800667]
In video person re-identification (Re-ID), the network must consistently extract features of the target person from successive frames.
Existing methods tend to focus only on how to use temporal information, which often leads to networks being fooled by similar appearances and same backgrounds.
We propose a Disentanglement and Switching and Aggregation Network (DSANet), which segregates the features representing identity and features based on camera characteristics, and pays more attention to ID information.
arXiv Detail & Related papers (2022-12-16T04:27:56Z) - A persistent homology-based topological loss for CNN-based multi-class
segmentation of CMR [5.898114915426535]
Multi-class segmentation of cardiac magnetic resonance (CMR) images seeks a separation of data into anatomical components with known structure and configuration.
Most popular CNN-based methods are optimised using pixel wise loss functions, ignorant of the spatially extended features that characterise anatomy.
We extend these approaches to the task of multi-class segmentation by building an enriched topological description of all class labels and class label pairs.
arXiv Detail & Related papers (2021-07-27T09:21:38Z) - 3D-ANAS: 3D Asymmetric Neural Architecture Search for Fast Hyperspectral
Image Classification [5.727964191623458]
Hyperspectral images involve abundant spectral and spatial information, playing an irreplaceable role in land-cover classification.
Recently, based on deep learning technologies, an increasing number of HSI classification approaches have been proposed, which demonstrate promising performance.
Previous studies suffer from two major drawbacks: 1) the architecture of most deep learning models is manually designed, relies on specialized knowledge, and is relatively tedious.
arXiv Detail & Related papers (2021-01-12T04:15:40Z) - Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for
Gesture Recognition [89.0152015268929]
We propose the first neural architecture search (NAS)-based method for RGB-D gesture recognition.
The proposed method includes two key components: 1) enhanced temporal representation via the 3D Central Difference Convolution (3D-CDC) family, and optimized backbones for multi-modal-rate branches and lateral connections.
The resultant multi-rate network provides a new perspective to understand the relationship between RGB and depth modalities and their temporal dynamics.
arXiv Detail & Related papers (2020-08-21T10:45:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.