HAMLET: A Hierarchical Multimodal Attention-based Human Activity
Recognition Algorithm
- URL: http://arxiv.org/abs/2008.01148v1
- Date: Mon, 3 Aug 2020 19:34:48 GMT
- Title: HAMLET: A Hierarchical Multimodal Attention-based Human Activity
Recognition Algorithm
- Authors: Md Mofijul Islam and Tariq Iqbal
- Abstract summary: Human activity recognition (HAR) is a challenging task for robots due to difficulties related to multimodal data fusion.
In this work, we introduce a neural network-based multimodal algorithm, HAMLET.
We develop a novel multimodal attention mechanism for disentangling and fusing the salient unimodal features to compute the multimodal features in the upper layer.
- Score: 5.276937617129594
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To fluently collaborate with people, robots need the ability to recognize
human activities accurately. Although modern robots are equipped with various
sensors, robust human activity recognition (HAR) still remains a challenging
task for robots due to difficulties related to multimodal data fusion. To
address these challenges, in this work, we introduce a deep neural
network-based multimodal HAR algorithm, HAMLET. HAMLET incorporates a
hierarchical architecture, where the lower layer encodes spatio-temporal
features from unimodal data by adopting a multi-head self-attention mechanism.
We develop a novel multimodal attention mechanism for disentangling and fusing
the salient unimodal features to compute the multimodal features in the upper
layer. Finally, multimodal features are used in a fully connect neural-network
to recognize human activities. We evaluated our algorithm by comparing its
performance to several state-of-the-art activity recognition algorithms on
three human activity datasets. The results suggest that HAMLET outperformed all
other evaluated baselines across all datasets and metrics tested, with the
highest top-1 accuracy of 95.12% and 97.45% on the UTD-MHAD [1] and the
UT-Kinect [2] datasets respectively, and F1-score of 81.52% on the UCSD-MIT [3]
dataset. We further visualize the unimodal and multimodal attention maps, which
provide us with a tool to interpret the impact of attention mechanisms
concerning HAR.
Related papers
- Understanding Spatio-Temporal Relations in Human-Object Interaction using Pyramid Graph Convolutional Network [2.223052975765005]
We propose a novel Pyramid Graph Convolutional Network (PGCN) to automatically recognize human-object interaction.
The system represents the 2D or 3D spatial relation of human and objects from the detection results in video data as a graph.
We evaluate our model on two challenging datasets in the field of human-object interaction recognition.
arXiv Detail & Related papers (2024-10-10T13:39:17Z) - Unified Framework with Consistency across Modalities for Human Activity Recognition [14.639249548669756]
We propose a comprehensive framework for robust video-based human activity recognition.
Key contribution is the introduction of a novel query machine, called COMPUTER.
Our approach demonstrates superior performance when compared with state-of-the-art methods.
arXiv Detail & Related papers (2024-09-04T02:25:10Z) - Self-Supervised Neuron Segmentation with Multi-Agent Reinforcement
Learning [53.00683059396803]
Mask image model (MIM) has been widely used due to its simplicity and effectiveness in recovering original information from masked images.
We propose a decision-based MIM that utilizes reinforcement learning (RL) to automatically search for optimal image masking ratio and masking strategy.
Our approach has a significant advantage over alternative self-supervised methods on the task of neuron segmentation.
arXiv Detail & Related papers (2023-10-06T10:40:46Z) - UMSNet: An Universal Multi-sensor Network for Human Activity Recognition [10.952666953066542]
This paper proposes a universal multi-sensor network (UMSNet) for human activity recognition.
In particular, we propose a new lightweight sensor residual block (called LSR block), which improves the performance.
Our framework has a clear structure and can be directly applied to various types of multi-modal Time Series Classification tasks.
arXiv Detail & Related papers (2022-05-24T03:29:54Z) - A Spatio-Temporal Multilayer Perceptron for Gesture Recognition [70.34489104710366]
We propose a multilayer state-weighted perceptron for gesture recognition in the context of autonomous vehicles.
An evaluation of TCG and Drive&Act datasets is provided to showcase the promising performance of our approach.
We deploy our model to our autonomous vehicle to show its real-time capability and stable execution.
arXiv Detail & Related papers (2022-04-25T08:42:47Z) - Addressing Data Scarcity in Multimodal User State Recognition by
Combining Semi-Supervised and Supervised Learning [1.1688030627514532]
We present a multimodal machine learning approach for detecting dis-/agreement and confusion states in a human-robot interaction environment.
We achieve an average F1-score of 81.1% for dis-/agreement detection with a small amount of labeled data and a large unlabeled data set.
arXiv Detail & Related papers (2022-02-08T10:41:41Z) - Learning Multi-Granular Spatio-Temporal Graph Network for Skeleton-based
Action Recognition [49.163326827954656]
We propose a novel multi-granular-temporal graph network for skeleton-based action classification.
We develop a dual-head graph network consisting of two inter-leaved branches, which enables us to extract at least two-temporal resolutions.
We conduct extensive experiments on three large-scale datasets.
arXiv Detail & Related papers (2021-08-10T09:25:07Z) - What Matters in Learning from Offline Human Demonstrations for Robot
Manipulation [64.43440450794495]
We conduct an extensive study of six offline learning algorithms for robot manipulation.
Our study analyzes the most critical challenges when learning from offline human data.
We highlight opportunities for learning from human datasets.
arXiv Detail & Related papers (2021-08-06T20:48:30Z) - PALMAR: Towards Adaptive Multi-inhabitant Activity Recognition in
Point-Cloud Technology [0.0]
We develop, PALMAR, a multiple-inhabitant activity recognition system by employing efficient signal processing and novel machine learning techniques.
We experimentally evaluate our framework and systems using (i) a real-time PCD collected by three devices (3D LiDAR and 79 GHz mmWave) from 6 participants, (ii) one publicly available 3D LiDAR activity data (28 participants) and (iii) an embedded hardware prototype system.
arXiv Detail & Related papers (2021-06-22T16:17:50Z) - Relational Graph Learning on Visual and Kinematics Embeddings for
Accurate Gesture Recognition in Robotic Surgery [84.73764603474413]
We propose a novel online approach of multi-modal graph network (i.e., MRG-Net) to dynamically integrate visual and kinematics information.
The effectiveness of our method is demonstrated with state-of-the-art results on the public JIGSAWS dataset.
arXiv Detail & Related papers (2020-11-03T11:00:10Z) - Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for
Gesture Recognition [89.0152015268929]
We propose the first neural architecture search (NAS)-based method for RGB-D gesture recognition.
The proposed method includes two key components: 1) enhanced temporal representation via the 3D Central Difference Convolution (3D-CDC) family, and optimized backbones for multi-modal-rate branches and lateral connections.
The resultant multi-rate network provides a new perspective to understand the relationship between RGB and depth modalities and their temporal dynamics.
arXiv Detail & Related papers (2020-08-21T10:45:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.