A Unified Multimodal De- and Re-coupling Framework for RGB-D Motion
Recognition
- URL: http://arxiv.org/abs/2211.09146v2
- Date: Mon, 12 Jun 2023 08:47:22 GMT
- Title: A Unified Multimodal De- and Re-coupling Framework for RGB-D Motion
Recognition
- Authors: Benjia Zhou, Pichao Wang, Jun Wan, Yanyan Liang and Fan Wang
- Abstract summary: We introduce a novel video data augmentation method dubbed ShuffleMix, which acts as a supplement to MixUp, to provide additional temporal regularization for motion recognition.
Secondly, a Unified Multimodal De-coupling and multi-stage Re-coupling framework, termed UMDR, is proposed for video representation learning.
- Score: 24.02488085447691
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Motion recognition is a promising direction in computer vision, but the
training of video classification models is much harder than images due to
insufficient data and considerable parameters. To get around this, some works
strive to explore multimodal cues from RGB-D data. Although improving motion
recognition to some extent, these methods still face sub-optimal situations in
the following aspects: (i) Data augmentation, i.e., the scale of the RGB-D
datasets is still limited, and few efforts have been made to explore novel data
augmentation strategies for videos; (ii) Optimization mechanism, i.e., the
tightly space-time-entangled network structure brings more challenges to
spatiotemporal information modeling; And (iii) cross-modal knowledge fusion,
i.e., the high similarity between multimodal representations caused to
insufficient late fusion. To alleviate these drawbacks, we propose to improve
RGB-D-based motion recognition both from data and algorithm perspectives in
this paper. In more detail, firstly, we introduce a novel video data
augmentation method dubbed ShuffleMix, which acts as a supplement to MixUp, to
provide additional temporal regularization for motion recognition. Secondly, a
Unified Multimodal De-coupling and multi-stage Re-coupling framework, termed
UMDR, is proposed for video representation learning. Finally, a novel
cross-modal Complement Feature Catcher (CFCer) is explored to mine potential
commonalities features in multimodal information as the auxiliary fusion
stream, to improve the late fusion results. The seamless combination of these
novel designs forms a robust spatiotemporal representation and achieves better
performance than state-of-the-art methods on four public motion datasets.
Specifically, UMDR achieves unprecedented improvements of +4.5% on the Chalearn
IsoGD dataset. Our code is available at
https://github.com/zhoubenjia/MotionRGBD-PAMI.
Related papers
- Multi-Dimensional Refinement Graph Convolutional Network with Robust
Decouple Loss for Fine-Grained Skeleton-Based Action Recognition [19.031036881780107]
We propose a flexible attention block called Channel-Variable Spatial-Temporal Attention (CVSTA) to enhance the discriminative power of spatial-temporal joints.
Based on CVSTA, we construct a Multi-Dimensional Refinement Graph Convolutional Network (MDR-GCN), which can improve the discrimination among channel-, joint- and frame-level features.
Furthermore, we propose a Robust Decouple Loss (RDL), which significantly boosts the effect of the CVSTA and reduces the impact of noise.
arXiv Detail & Related papers (2023-06-27T09:23:36Z) - IMKGA-SM: Interpretable Multimodal Knowledge Graph Answer Prediction via
Sequence Modeling [3.867363075280544]
Multimodal knowledge graph link prediction aims to improve the accuracy and efficiency of link prediction tasks for multimodal data.
New model is developed, namely Interpretable Multimodal Knowledge Graph Answer Prediction via Sequence Modeling (IMKGA-SM)
Model achieves much better performance than SOTA baselines on multimodal link prediction datasets of different sizes.
arXiv Detail & Related papers (2023-01-06T10:08:11Z) - Decoupling and Recoupling Spatiotemporal Representation for RGB-D-based
Motion Recognition [62.46544616232238]
Previous motion recognition methods have achieved promising performance through the tightly coupled multi-temporal representation.
We propose to decouple and recouple caused caused representation for RGB-D-based motion recognition.
arXiv Detail & Related papers (2021-12-16T18:59:47Z) - RGB-D Saliency Detection via Cascaded Mutual Information Minimization [122.8879596830581]
Existing RGB-D saliency detection models do not explicitly encourage RGB and depth to achieve effective multi-modal learning.
We introduce a novel multi-stage cascaded learning framework via mutual information minimization to "explicitly" model the multi-modal information between RGB image and depth data.
arXiv Detail & Related papers (2021-09-15T12:31:27Z) - Depth Guided Adaptive Meta-Fusion Network for Few-shot Video Recognition [86.31412529187243]
Few-shot video recognition aims at learning new actions with only very few labeled samples.
We propose a depth guided Adaptive Meta-Fusion Network for few-shot video recognition which is termed as AMeFu-Net.
arXiv Detail & Related papers (2020-10-20T03:06:20Z) - Learning Selective Mutual Attention and Contrast for RGB-D Saliency
Detection [145.4919781325014]
How to effectively fuse cross-modal information is the key problem for RGB-D salient object detection.
Many models use the feature fusion strategy but are limited by the low-order point-to-point fusion methods.
We propose a novel mutual attention model by fusing attention and contexts from different modalities.
arXiv Detail & Related papers (2020-10-12T08:50:10Z) - Adaptive Context-Aware Multi-Modal Network for Depth Completion [107.15344488719322]
We propose to adopt the graph propagation to capture the observed spatial contexts.
We then apply the attention mechanism on the propagation, which encourages the network to model the contextual information adaptively.
Finally, we introduce the symmetric gated fusion strategy to exploit the extracted multi-modal features effectively.
Our model, named Adaptive Context-Aware Multi-Modal Network (ACMNet), achieves the state-of-the-art performance on two benchmarks.
arXiv Detail & Related papers (2020-08-25T06:00:06Z) - Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for
Gesture Recognition [89.0152015268929]
We propose the first neural architecture search (NAS)-based method for RGB-D gesture recognition.
The proposed method includes two key components: 1) enhanced temporal representation via the 3D Central Difference Convolution (3D-CDC) family, and optimized backbones for multi-modal-rate branches and lateral connections.
The resultant multi-rate network provides a new perspective to understand the relationship between RGB and depth modalities and their temporal dynamics.
arXiv Detail & Related papers (2020-08-21T10:45:09Z) - Skeleton Focused Human Activity Recognition in RGB Video [11.521107108725188]
We propose a multimodal feature fusion model that utilizes both skeleton and RGB modalities to infer human activity.
The model could be either individually or uniformly trained by the back-propagation algorithm in an end-to-end manner.
arXiv Detail & Related papers (2020-04-29T06:40:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.