Infrared and 3D skeleton feature fusion for RGB-D action recognition
- URL: http://arxiv.org/abs/2002.12886v1
- Date: Fri, 28 Feb 2020 17:42:53 GMT
- Title: Infrared and 3D skeleton feature fusion for RGB-D action recognition
- Authors: Alban Main de Boissiere, Rita Noumeir
- Abstract summary: We propose a modular network combining skeleton and infrared data.
A 2D convolutional network (CNN) is used as a pose module to extract features from skeleton data.
A 3D CNN is used as an infrared module to extract visual cues from videos.
- Score: 0.30458514384586394
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A challenge of skeleton-based action recognition is the difficulty to
classify actions with similar motions and object-related actions. Visual clues
from other streams help in that regard. RGB data are sensible to illumination
conditions, thus unusable in the dark. To alleviate this issue and still
benefit from a visual stream, we propose a modular network (FUSION) combining
skeleton and infrared data. A 2D convolutional neural network (CNN) is used as
a pose module to extract features from skeleton data. A 3D CNN is used as an
infrared module to extract visual cues from videos. Both feature vectors are
then concatenated and exploited conjointly using a multilayer perceptron (MLP).
Skeleton data also condition the infrared videos, providing a crop around the
performing subjects and thus virtually focusing the attention of the infrared
module. Ablation studies show that using pre-trained networks on other large
scale datasets as our modules and data augmentation yield considerable
improvements on the action classification accuracy. The strong contribution of
our cropping strategy is also demonstrated. We evaluate our method on the NTU
RGB+D dataset, the largest dataset for human action recognition from depth
cameras, and report state-of-the-art performances.
Related papers
- fMRI-3D: A Comprehensive Dataset for Enhancing fMRI-based 3D Reconstruction [50.534007259536715]
We present the fMRI-3D dataset, which includes data from 15 participants and showcases a total of 4768 3D objects.
We propose MinD-3D, a novel framework designed to decode 3D visual information from fMRI signals.
arXiv Detail & Related papers (2024-09-17T16:13:59Z) - Salient Object Detection in RGB-D Videos [11.805682025734551]
This paper makes two primary contributions: the dataset and the model.
We construct the RDVS dataset, a new RGB-D VSOD dataset with realistic depth.
We introduce DCTNet+, a three-stream network tailored for RGB-D VSOD.
arXiv Detail & Related papers (2023-10-24T03:18:07Z) - Tensor Factorization for Leveraging Cross-Modal Knowledge in
Data-Constrained Infrared Object Detection [22.60228799622782]
Key bottleneck in object detection in IR images is lack of sufficient labeled training data.
We seek to leverage cues from the RGB modality to scale object detectors to the IR modality, while preserving model performance in the RGB modality.
We first pretrain these factor matrices on the RGB modality, for which plenty of training data are assumed to exist and then augment only a few trainable parameters for training on the IR modality to avoid over-fitting.
arXiv Detail & Related papers (2023-09-28T16:55:52Z) - Learning Dynamic View Synthesis With Few RGBD Cameras [60.36357774688289]
We propose to utilize RGBD cameras to synthesize free-viewpoint videos of dynamic indoor scenes.
We generate point clouds from RGBD frames and then render them into free-viewpoint videos via a neural feature.
We introduce a simple Regional Depth-Inpainting module that adaptively inpaints missing depth values to render complete novel views.
arXiv Detail & Related papers (2022-04-22T03:17:35Z) - HighlightMe: Detecting Highlights from Human-Centric Videos [62.265410865423]
We present a domain- and user-preference-agnostic approach to detect highlightable excerpts from human-centric videos.
We use an autoencoder network equipped with spatial-temporal graph convolutions to detect human activities and interactions.
We observe a 4-12% improvement in the mean average precision of matching the human-annotated highlights over state-of-the-art methods.
arXiv Detail & Related papers (2021-10-05T01:18:15Z) - ChiNet: Deep Recurrent Convolutional Learning for Multimodal Spacecraft
Pose Estimation [3.964047152162558]
This paper presents an innovative deep learning pipeline which estimates the relative pose of a spacecraft by incorporating the temporal information from a rendezvous sequence.
It leverages the performance of long short-term memory (LSTM) units in modelling sequences of data for the processing of features extracted by a convolutional neural network (CNN) backbone.
Three distinct training strategies, which follow a coarse-to-fine funnelled approach, are combined to facilitate feature learning and improve end-to-end pose estimation by regression.
arXiv Detail & Related papers (2021-08-23T16:48:58Z) - MFGNet: Dynamic Modality-Aware Filter Generation for RGB-T Tracking [72.65494220685525]
We propose a new dynamic modality-aware filter generation module (named MFGNet) to boost the message communication between visible and thermal data.
We generate dynamic modality-aware filters with two independent networks. The visible and thermal filters will be used to conduct a dynamic convolutional operation on their corresponding input feature maps respectively.
To address issues caused by heavy occlusion, fast motion, and out-of-view, we propose to conduct a joint local and global search by exploiting a new direction-aware target-driven attention mechanism.
arXiv Detail & Related papers (2021-07-22T03:10:51Z) - Cloud based Scalable Object Recognition from Video Streams using
Orientation Fusion and Convolutional Neural Networks [11.44782606621054]
Convolutional neural networks (CNNs) have been widely used to perform intelligent visual object recognition.
CNNs still suffer from severe accuracy degradation, particularly on illumination-variant datasets.
We propose a new CNN method based on orientation fusion for visual object recognition.
arXiv Detail & Related papers (2021-06-19T07:15:15Z) - MobileSal: Extremely Efficient RGB-D Salient Object Detection [62.04876251927581]
This paper introduces a novel network, methodname, which focuses on efficient RGB-D salient object detection (SOD)
We propose an implicit depth restoration (IDR) technique to strengthen the feature representation capability of mobile networks for RGB-D SOD.
With IDR and CPR incorporated, methodnameperforms favorably against sArt methods on seven challenging RGB-D SOD datasets.
arXiv Detail & Related papers (2020-12-24T04:36:42Z) - Self-supervised Video Representation Learning by Uncovering
Spatio-temporal Statistics [74.6968179473212]
This paper proposes a novel pretext task to address the self-supervised learning problem.
We compute a series of partitioning-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion.
A neural network is built and trained to yield the statistical summaries given the video frames as inputs.
arXiv Detail & Related papers (2020-08-31T08:31:56Z) - Skeleton Focused Human Activity Recognition in RGB Video [11.521107108725188]
We propose a multimodal feature fusion model that utilizes both skeleton and RGB modalities to infer human activity.
The model could be either individually or uniformly trained by the back-propagation algorithm in an end-to-end manner.
arXiv Detail & Related papers (2020-04-29T06:40:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.