Multi-stage Factorized Spatio-Temporal Representation for RGB-D Action
and Gesture Recognition
- URL: http://arxiv.org/abs/2308.12006v2
- Date: Mon, 11 Sep 2023 03:01:50 GMT
- Title: Multi-stage Factorized Spatio-Temporal Representation for RGB-D Action
and Gesture Recognition
- Authors: Yujun Ma, Benjia Zhou, Ruili Wang, Pichao Wang
- Abstract summary: We propose an innovative architecture called Multi-stage Factorized-Trans (MFST) for RGB-D action and gesture recognition.
MFST model comprises a 3D Difference Con Stem (CDC-Stem) module and multiple factorizedtemporal stages.
- Score: 30.975823858419965
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: RGB-D action and gesture recognition remain an interesting topic in
human-centered scene understanding, primarily due to the multiple granularities
and large variation in human motion. Although many RGB-D based action and
gesture recognition approaches have demonstrated remarkable results by
utilizing highly integrated spatio-temporal representations across multiple
modalities (i.e., RGB and depth data), they still encounter several challenges.
Firstly, vanilla 3D convolution makes it hard to capture fine-grained motion
differences between local clips under different modalities. Secondly, the
intricate nature of highly integrated spatio-temporal modeling can lead to
optimization difficulties. Thirdly, duplicate and unnecessary information can
add complexity and complicate entangled spatio-temporal modeling. To address
the above issues, we propose an innovative heuristic architecture called
Multi-stage Factorized Spatio-Temporal (MFST) for RGB-D action and gesture
recognition. The proposed MFST model comprises a 3D Central Difference
Convolution Stem (CDC-Stem) module and multiple factorized spatio-temporal
stages. The CDC-Stem enriches fine-grained temporal perception, and the
multiple hierarchical spatio-temporal stages construct dimension-independent
higher-order semantic primitives. Specifically, the CDC-Stem module captures
bottom-level spatio-temporal features and passes them successively to the
following spatio-temporal factored stages to capture the hierarchical spatial
and temporal features through the Multi- Scale Convolution and Transformer
(MSC-Trans) hybrid block and Weight-shared Multi-Scale Transformer (WMS-Trans)
block. The seamless integration of these innovative designs results in a robust
spatio-temporal representation that outperforms state-of-the-art approaches on
RGB-D action and gesture recognition datasets.
Related papers
- Gait Recognition in the Wild with Multi-hop Temporal Switch [81.35245014397759]
gait recognition in the wild is a more practical problem that has attracted the attention of the community of multimedia and computer vision.
This paper presents a novel multi-hop temporal switch method to achieve effective temporal modeling of gait patterns in real-world scenes.
arXiv Detail & Related papers (2022-09-01T10:46:09Z) - Decoupling and Recoupling Spatiotemporal Representation for RGB-D-based
Motion Recognition [62.46544616232238]
Previous motion recognition methods have achieved promising performance through the tightly coupled multi-temporal representation.
We propose to decouple and recouple caused caused representation for RGB-D-based motion recognition.
arXiv Detail & Related papers (2021-12-16T18:59:47Z) - Spatio-Temporal Representation Factorization for Video-based Person
Re-Identification [55.01276167336187]
We propose Spatio-Temporal Representation Factorization module (STRF) for re-ID.
STRF is a flexible new computational unit that can be used in conjunction with most existing 3D convolutional neural network architectures for re-ID.
We empirically show that STRF improves performance of various existing baseline architectures while demonstrating new state-of-the-art results.
arXiv Detail & Related papers (2021-07-25T19:29:37Z) - Multi-Temporal Convolutions for Human Action Recognition in Videos [83.43682368129072]
We present a novel temporal-temporal convolution block that is capable of extracting at multiple resolutions.
The proposed blocks are lightweight and can be integrated into any 3D-CNN architecture.
arXiv Detail & Related papers (2020-11-08T10:40:26Z) - Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for
Gesture Recognition [89.0152015268929]
We propose the first neural architecture search (NAS)-based method for RGB-D gesture recognition.
The proposed method includes two key components: 1) enhanced temporal representation via the 3D Central Difference Convolution (3D-CDC) family, and optimized backbones for multi-modal-rate branches and lateral connections.
The resultant multi-rate network provides a new perspective to understand the relationship between RGB and depth modalities and their temporal dynamics.
arXiv Detail & Related papers (2020-08-21T10:45:09Z) - TEA: Temporal Excitation and Aggregation for Action Recognition [31.076707274791957]
We propose a Temporal Excitation and Aggregation block, including a motion excitation module and a multiple temporal aggregation module.
For short-range motion modeling, the ME module calculates the feature-level temporal differences fromtemporal features.
The MTA module proposes to deform the local convolution to a group of sub-convolutions, forming a hierarchical residual architecture.
arXiv Detail & Related papers (2020-04-03T06:53:30Z) - Disentangling and Unifying Graph Convolutions for Skeleton-Based Action
Recognition [79.33539539956186]
We propose a simple method to disentangle multi-scale graph convolutions and a unified spatial-temporal graph convolutional operator named G3D.
By coupling these proposals, we develop a powerful feature extractor named MS-G3D based on which our model outperforms previous state-of-the-art methods on three large-scale datasets.
arXiv Detail & Related papers (2020-03-31T11:28:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.