Skeleton Focused Human Activity Recognition in RGB Video
- URL: http://arxiv.org/abs/2004.13979v1
- Date: Wed, 29 Apr 2020 06:40:42 GMT
- Title: Skeleton Focused Human Activity Recognition in RGB Video
- Authors: Bruce X. B. Yu, Yan Liu, Keith C. C. Chan
- Abstract summary: We propose a multimodal feature fusion model that utilizes both skeleton and RGB modalities to infer human activity.
The model could be either individually or uniformly trained by the back-propagation algorithm in an end-to-end manner.
- Score: 11.521107108725188
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The data-driven approach that learns an optimal representation of vision
features like skeleton frames or RGB videos is currently a dominant paradigm
for activity recognition. While great improvements have been achieved from
existing single modal approaches with increasingly larger datasets, the fusion
of various data modalities at the feature level has seldom been attempted. In
this paper, we propose a multimodal feature fusion model that utilizes both
skeleton and RGB modalities to infer human activity. The objective is to
improve the activity recognition accuracy by effectively utilizing the mutual
complemental information among different data modalities. For the skeleton
modality, we propose to use a graph convolutional subnetwork to learn the
skeleton representation. Whereas for the RGB modality, we will use the
spatial-temporal region of interest from RGB videos and take the attention
features from the skeleton modality to guide the learning process. The model
could be either individually or uniformly trained by the back-propagation
algorithm in an end-to-end manner. The experimental results for the NTU-RGB+D
and Northwestern-UCLA Multiview datasets achieved state-of-the-art performance,
which indicates that the proposed skeleton-driven attention mechanism for the
RGB modality increases the mutual communication between different data
modalities and brings more discriminative features for inferring human
activities.
Related papers
- Language Supervised Human Action Recognition with Salient Fusion: Construction Worker Action Recognition as a Use Case [8.26451988845854]
We introduce a novel approach to Human Action Recognition (HAR) based on skeleton and visual cues.
We employ learnable prompts for the language model conditioned on the skeleton modality to optimize feature representation.
We introduce a new dataset tailored for real-world robotic applications in construction sites, featuring visual, skeleton, and depth data modalities.
arXiv Detail & Related papers (2024-10-02T19:10:23Z) - Adversarial Robustness in RGB-Skeleton Action Recognition: Leveraging Attention Modality Reweighter [32.64004722423187]
We show how to improve the robustness of RGB-skeleton action recognition models.
We propose the formatwordAttention-based formatwordModality formatwordReweighter (formatwordAMR)
Our AMR is plug-and-play, allowing easy integration with multimodal models.
arXiv Detail & Related papers (2024-07-29T13:15:51Z) - Skeleton-based Group Activity Recognition via Spatial-Temporal Panoramic Graph [4.075741925017479]
Group Activity Recognition aims to understand collective activities from videos.
Existing solutions rely on the RGB modality, which encounters challenges such as background variations.
We design a panoramic graph that incorporates multi-person skeletons and objects to encapsulate group activity.
arXiv Detail & Related papers (2024-07-28T13:57:03Z) - Multi-Modality Co-Learning for Efficient Skeleton-based Action Recognition [12.382193259575805]
We propose a novel multi-modality co-learning (MMCL) framework for efficient skeleton-based action recognition.
Our MMCL framework engages in multi-modality co-learning during the training stage and keeps efficiency by employing only concise skeletons in inference.
arXiv Detail & Related papers (2024-07-22T15:16:47Z) - An Information Compensation Framework for Zero-Shot Skeleton-based Action Recognition [49.45660055499103]
Zero-shot human skeleton-based action recognition aims to construct a model that can recognize actions outside the categories seen during training.
Previous research has focused on aligning sequences' visual and semantic spatial distributions.
We introduce a new loss function sampling method to obtain a tight and robust representation.
arXiv Detail & Related papers (2024-06-02T06:53:01Z) - Egocentric RGB+Depth Action Recognition in Industry-Like Settings [50.38638300332429]
Our work focuses on recognizing actions from egocentric RGB and Depth modalities in an industry-like environment.
Our framework is based on the 3D Video SWIN Transformer to encode both RGB and Depth modalities effectively.
Our method also secured first place at the multimodal action recognition challenge at ICIAP 2023.
arXiv Detail & Related papers (2023-09-25T08:56:22Z) - A Unified Multimodal De- and Re-coupling Framework for RGB-D Motion
Recognition [24.02488085447691]
We introduce a novel video data augmentation method dubbed ShuffleMix, which acts as a supplement to MixUp, to provide additional temporal regularization for motion recognition.
Secondly, a Unified Multimodal De-coupling and multi-stage Re-coupling framework, termed UMDR, is proposed for video representation learning.
arXiv Detail & Related papers (2022-11-16T19:00:23Z) - Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based
Action Recognition [88.34182299496074]
Action labels are only available on a source dataset, but unavailable on a target dataset in the training stage.
We utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets.
By segmenting and permuting temporal segments or human body parts, we design two self-supervised learning classification tasks.
arXiv Detail & Related papers (2022-07-17T07:05:39Z) - Learning Multi-Granular Spatio-Temporal Graph Network for Skeleton-based
Action Recognition [49.163326827954656]
We propose a novel multi-granular-temporal graph network for skeleton-based action classification.
We develop a dual-head graph network consisting of two inter-leaved branches, which enables us to extract at least two-temporal resolutions.
We conduct extensive experiments on three large-scale datasets.
arXiv Detail & Related papers (2021-08-10T09:25:07Z) - Pose And Joint-Aware Action Recognition [87.4780883700755]
We present a new model for joint-based action recognition, which first extracts motion features from each joint separately through a shared motion encoder.
Our joint selector module re-weights the joint information to select the most discriminative joints for the task.
We show large improvements over the current state-of-the-art joint-based approaches on JHMDB, HMDB, Charades, AVA action recognition datasets.
arXiv Detail & Related papers (2020-10-16T04:43:34Z) - Modality Compensation Network: Cross-Modal Adaptation for Action
Recognition [77.24983234113957]
We propose a Modality Compensation Network (MCN) to explore the relationships of different modalities.
Our model bridges data from source and auxiliary modalities by a modality adaptation block to achieve adaptive representation learning.
Experimental results reveal that MCN outperforms state-of-the-art approaches on four widely-used action recognition benchmarks.
arXiv Detail & Related papers (2020-01-31T04:51:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.