Decoupling and Recoupling Spatiotemporal Representation for RGB-D-based
Motion Recognition
- URL: http://arxiv.org/abs/2112.09129v1
- Date: Thu, 16 Dec 2021 18:59:47 GMT
- Title: Decoupling and Recoupling Spatiotemporal Representation for RGB-D-based
Motion Recognition
- Authors: Benjia Zhou and Pichao Wang and Jun Wan and Yanyan Liang and Fan Wang
and Du Zhang and Zhen Lei and Hao Li and Rong Jin
- Abstract summary: Previous motion recognition methods have achieved promising performance through the tightly coupled multi-temporal representation.
We propose to decouple and recouple caused caused representation for RGB-D-based motion recognition.
- Score: 62.46544616232238
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Decoupling spatiotemporal representation refers to decomposing the spatial
and temporal features into dimension-independent factors. Although previous
RGB-D-based motion recognition methods have achieved promising performance
through the tightly coupled multi-modal spatiotemporal representation, they
still suffer from (i) optimization difficulty under small data setting due to
the tightly spatiotemporal-entangled modeling;(ii) information redundancy as it
usually contains lots of marginal information that is weakly relevant to
classification; and (iii) low interaction between multi-modal spatiotemporal
information caused by insufficient late fusion. To alleviate these drawbacks,
we propose to decouple and recouple spatiotemporal representation for
RGB-D-based motion recognition. Specifically, we disentangle the task of
learning spatiotemporal representation into 3 sub-tasks: (1) Learning
high-quality and dimension independent features through a decoupled spatial and
temporal modeling network. (2) Recoupling the decoupled representation to
establish stronger space-time dependency. (3) Introducing a Cross-modal
Adaptive Posterior Fusion (CAPF) mechanism to capture cross-modal
spatiotemporal information from RGB-D data. Seamless combination of these novel
designs forms a robust spatialtemporal representation and achieves better
performance than state-of-the-art methods on four public motion datasets. Our
code is available at https://github.com/damo-cv/MotionRGBD.
Related papers
- Surgformer: Surgical Transformer with Hierarchical Temporal Attention for Surgical Phase Recognition [7.682613953680041]
We propose the Surgical Transformer (Surgformer) to address the issues of spatial-temporal modeling and redundancy in an end-to-end manner.
We show that our proposed Surgformer performs favorably against the state-of-the-art methods.
arXiv Detail & Related papers (2024-08-07T16:16:31Z) - Multi-stage Factorized Spatio-Temporal Representation for RGB-D Action
and Gesture Recognition [30.975823858419965]
We propose an innovative architecture called Multi-stage Factorized-Trans (MFST) for RGB-D action and gesture recognition.
MFST model comprises a 3D Difference Con Stem (CDC-Stem) module and multiple factorizedtemporal stages.
arXiv Detail & Related papers (2023-08-23T08:49:43Z) - Modeling Continuous Motion for 3D Point Cloud Object Tracking [54.48716096286417]
This paper presents a novel approach that views each tracklet as a continuous stream.
At each timestamp, only the current frame is fed into the network to interact with multi-frame historical features stored in a memory bank.
To enhance the utilization of multi-frame features for robust tracking, a contrastive sequence enhancement strategy is proposed.
arXiv Detail & Related papers (2023-03-14T02:58:27Z) - A Unified Multimodal De- and Re-coupling Framework for RGB-D Motion
Recognition [24.02488085447691]
We introduce a novel video data augmentation method dubbed ShuffleMix, which acts as a supplement to MixUp, to provide additional temporal regularization for motion recognition.
Secondly, a Unified Multimodal De-coupling and multi-stage Re-coupling framework, termed UMDR, is proposed for video representation learning.
arXiv Detail & Related papers (2022-11-16T19:00:23Z) - Ret3D: Rethinking Object Relations for Efficient 3D Object Detection in
Driving Scenes [82.4186966781934]
We introduce a simple, efficient, and effective two-stage detector, termed as Ret3D.
At the core of Ret3D is the utilization of novel intra-frame and inter-frame relation modules.
With negligible extra overhead, Ret3D achieves the state-of-the-art performance.
arXiv Detail & Related papers (2022-08-18T03:48:58Z) - Spatial Temporal Graph Attention Network for Skeleton-Based Action
Recognition [10.60209288486904]
It's common for current methods in skeleton-based action recognition to mainly consider capturing long-term temporal dependencies.
We propose a general framework, coined as STGAT, to model cross-spacetime information flow.
STGAT achieves state-of-the-art performance on three large-scale datasets.
arXiv Detail & Related papers (2022-08-18T02:34:46Z) - Multi-Temporal Convolutions for Human Action Recognition in Videos [83.43682368129072]
We present a novel temporal-temporal convolution block that is capable of extracting at multiple resolutions.
The proposed blocks are lightweight and can be integrated into any 3D-CNN architecture.
arXiv Detail & Related papers (2020-11-08T10:40:26Z) - Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for
Gesture Recognition [89.0152015268929]
We propose the first neural architecture search (NAS)-based method for RGB-D gesture recognition.
The proposed method includes two key components: 1) enhanced temporal representation via the 3D Central Difference Convolution (3D-CDC) family, and optimized backbones for multi-modal-rate branches and lateral connections.
The resultant multi-rate network provides a new perspective to understand the relationship between RGB and depth modalities and their temporal dynamics.
arXiv Detail & Related papers (2020-08-21T10:45:09Z) - Disentangling and Unifying Graph Convolutions for Skeleton-Based Action
Recognition [79.33539539956186]
We propose a simple method to disentangle multi-scale graph convolutions and a unified spatial-temporal graph convolutional operator named G3D.
By coupling these proposals, we develop a powerful feature extractor named MS-G3D based on which our model outperforms previous state-of-the-art methods on three large-scale datasets.
arXiv Detail & Related papers (2020-03-31T11:28:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.