Multi-modal Fusion for Single-Stage Continuous Gesture Recognition
- URL: http://arxiv.org/abs/2011.04945v2
- Date: Tue, 24 Aug 2021 06:36:51 GMT
- Title: Multi-modal Fusion for Single-Stage Continuous Gesture Recognition
- Authors: Harshala Gammulle, Simon Denman, Sridha Sridharan, Clinton Fookes
- Abstract summary: We introduce a single-stage continuous gesture recognition framework, called Temporal Multi-Modal Fusion (TMMF)
TMMF can detect and classify multiple gestures in a video via a single model.
This approach learns the natural transitions between gestures and non-gestures without the need for a pre-processing segmentation step.
- Score: 45.19890687786009
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Gesture recognition is a much studied research area which has myriad
real-world applications including robotics and human-machine interaction.
Current gesture recognition methods have focused on recognising isolated
gestures, and existing continuous gesture recognition methods are limited to
two-stage approaches where independent models are required for detection and
classification, with the performance of the latter being constrained by
detection performance. In contrast, we introduce a single-stage continuous
gesture recognition framework, called Temporal Multi-Modal Fusion (TMMF), that
can detect and classify multiple gestures in a video via a single model. This
approach learns the natural transitions between gestures and non-gestures
without the need for a pre-processing segmentation step to detect individual
gestures. To achieve this, we introduce a multi-modal fusion mechanism to
support the integration of important information that flows from multi-modal
inputs, and is scalable to any number of modes. Additionally, we propose
Unimodal Feature Mapping (UFM) and Multi-modal Feature Mapping (MFM) models to
map uni-modal features and the fused multi-modal features respectively. To
further enhance performance, we propose a mid-point based loss function that
encourages smooth alignment between the ground truth and the prediction,
helping the model to learn natural gesture transitions. We demonstrate the
utility of our proposed framework, which can handle variable-length input
videos, and outperforms the state-of-the-art on three challenging datasets:
EgoGesture, IPN hand, and ChaLearn LAP Continuous Gesture Dataset (ConGD).
Furthermore, ablation experiments show the importance of different components
of the proposed framework.
Related papers
- Masked Graph Learning with Recurrent Alignment for Multimodal Emotion Recognition in Conversation [12.455034591553506]
Multimodal Emotion Recognition in Conversation (MERC) can be applied to public opinion monitoring, intelligent dialogue robots, and other fields.
Previous work ignored the inter-modal alignment process and the intra-modal noise information before multimodal fusion.
We have developed a novel approach called Masked Graph Learning with Recursive Alignment (MGLRA) to tackle this problem.
arXiv Detail & Related papers (2024-07-23T02:23:51Z) - Asynchronous Multimodal Video Sequence Fusion via Learning Modality-Exclusive and -Agnostic Representations [19.731611716111566]
We propose a Multimodal fusion approach for learning modality-Exclusive and modality-Agnostic representations.
We introduce a predictive self-attention module to capture reliable context dynamics within modalities.
A hierarchical cross-modal attention module is designed to explore valuable element correlations among modalities.
A double-discriminator strategy is presented to ensure the production of distinct representations in an adversarial manner.
arXiv Detail & Related papers (2024-07-06T04:36:48Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Modeling Continuous Motion for 3D Point Cloud Object Tracking [54.48716096286417]
This paper presents a novel approach that views each tracklet as a continuous stream.
At each timestamp, only the current frame is fed into the network to interact with multi-frame historical features stored in a memory bank.
To enhance the utilization of multi-frame features for robust tracking, a contrastive sequence enhancement strategy is proposed.
arXiv Detail & Related papers (2023-03-14T02:58:27Z) - Interactive Multi-scale Fusion of 2D and 3D Features for Multi-object
Tracking [23.130490413184596]
We introduce PointNet++ to obtain multi-scale deep representations of point cloud to make it adaptive to our proposed Interactive Feature Fusion.
Our method can achieve good performance on the KITTI benchmark and outperform other approaches without using multi-scale feature fusion.
arXiv Detail & Related papers (2022-03-30T13:00:27Z) - Relational Graph Learning on Visual and Kinematics Embeddings for
Accurate Gesture Recognition in Robotic Surgery [84.73764603474413]
We propose a novel online approach of multi-modal graph network (i.e., MRG-Net) to dynamically integrate visual and kinematics information.
The effectiveness of our method is demonstrated with state-of-the-art results on the public JIGSAWS dataset.
arXiv Detail & Related papers (2020-11-03T11:00:10Z) - Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for
Gesture Recognition [89.0152015268929]
We propose the first neural architecture search (NAS)-based method for RGB-D gesture recognition.
The proposed method includes two key components: 1) enhanced temporal representation via the 3D Central Difference Convolution (3D-CDC) family, and optimized backbones for multi-modal-rate branches and lateral connections.
The resultant multi-rate network provides a new perspective to understand the relationship between RGB and depth modalities and their temporal dynamics.
arXiv Detail & Related papers (2020-08-21T10:45:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.