Exploiting the ConvLSTM: Human Action Recognition using Raw Depth
Video-Based Recurrent Neural Networks
- URL: http://arxiv.org/abs/2006.07744v1
- Date: Sat, 13 Jun 2020 23:35:59 GMT
- Title: Exploiting the ConvLSTM: Human Action Recognition using Raw Depth
Video-Based Recurrent Neural Networks
- Authors: Adrian Sanchez-Caballero, David Fuentes-Jimenez, Cristina
Losada-Guti\'errez
- Abstract summary: We propose and compare two neural networks based on the convolutional long short-term memory unit, namely ConvLSTM.
We show that the proposed models achieve competitive recognition accuracies with lower computational cost compared with state-of-the-art methods.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As in many other different fields, deep learning has become the main approach
in most computer vision applications, such as scene understanding, object
recognition, computer-human interaction or human action recognition (HAR).
Research efforts within HAR have mainly focused on how to efficiently extract
and process both spatial and temporal dependencies of video sequences. In this
paper, we propose and compare, two neural networks based on the convolutional
long short-term memory unit, namely ConvLSTM, with differences in the
architecture and the long-term learning strategy. The former uses a
video-length adaptive input data generator (\emph{stateless}) whereas the
latter explores the \emph{stateful} ability of general recurrent neural
networks but applied in the particular case of HAR. This stateful property
allows the model to accumulate discriminative patterns from previous frames
without compromising computer memory. Experimental results on the large-scale
NTU RGB+D dataset show that the proposed models achieve competitive recognition
accuracies with lower computational cost compared with state-of-the-art methods
and prove that, in the particular case of videos, the rarely-used stateful mode
of recurrent neural networks significantly improves the accuracy obtained with
the standard mode. The recognition accuracies obtained are 75.26\% (CS) and
75.45\% (CV) for the stateless model, with an average time consumption per
video of 0.21 s, and 80.43\% (CS) and 79.91\%(CV) with 0.89 s for the stateful
version.
Related papers
- Heterogenous Memory Augmented Neural Networks [84.29338268789684]
We introduce a novel heterogeneous memory augmentation approach for neural networks.
By introducing learnable memory tokens with attention mechanism, we can effectively boost performance without huge computational overhead.
We show our approach on various image and graph-based tasks under both in-distribution (ID) and out-of-distribution (OOD) conditions.
arXiv Detail & Related papers (2023-10-17T01:05:28Z) - Human activity recognition using deep learning approaches and single
frame cnn and convolutional lstm [0.0]
We explore two deep learning-based approaches, namely single frame Convolutional Neural Networks (CNNs) and convolutional Long Short-Term Memory to recognise human actions from videos.
The two models were trained and evaluated on a benchmark action recognition dataset, UCF50, and another dataset that was created for the experimentation.
Though both models exhibit good accuracies, the single frame CNN model outperforms the Convolutional LSTM model by having an accuracy of 99.8% with the UCF50 dataset.
arXiv Detail & Related papers (2023-04-18T01:33:29Z) - Continuous time recurrent neural networks: overview and application to
forecasting blood glucose in the intensive care unit [56.801856519460465]
Continuous time autoregressive recurrent neural networks (CTRNNs) are a deep learning model that account for irregular observations.
We demonstrate the application of these models to probabilistic forecasting of blood glucose in a critical care setting.
arXiv Detail & Related papers (2023-04-14T09:39:06Z) - Video Action Recognition Collaborative Learning with Dynamics via
PSO-ConvNet Transformer [1.876462046907555]
We propose a novel PSO-ConvNet model for learning actions in videos.
Our experimental results on the UCF-101 dataset demonstrate substantial improvements of up to 9% in accuracy.
Overall, our dynamic PSO-ConvNet model provides a promising direction for improving Human Action Recognition.
arXiv Detail & Related papers (2023-02-17T23:39:34Z) - Differentiable Frequency-based Disentanglement for Aerial Video Action
Recognition [56.91538445510214]
We present a learning algorithm for human activity recognition in videos.
Our approach is designed for UAV videos, which are mainly acquired from obliquely placed dynamic cameras.
We conduct extensive experiments on the UAV Human dataset and the NEC Drone dataset.
arXiv Detail & Related papers (2022-09-15T22:16:52Z) - CONVIQT: Contrastive Video Quality Estimator [63.749184706461826]
Perceptual video quality assessment (VQA) is an integral component of many streaming and video sharing platforms.
Here we consider the problem of learning perceptually relevant video quality representations in a self-supervised manner.
Our results indicate that compelling representations with perceptual bearing can be obtained using self-supervised learning.
arXiv Detail & Related papers (2022-06-29T15:22:01Z) - Variable Bitrate Neural Fields [75.24672452527795]
We present a dictionary method for compressing feature grids, reducing their memory consumption by up to 100x.
We formulate the dictionary optimization as a vector-quantized auto-decoder problem which lets us learn end-to-end discrete neural representations in a space where no direct supervision is available.
arXiv Detail & Related papers (2022-06-15T17:58:34Z) - Time-Frequency Localization Using Deep Convolutional Maxout Neural
Network in Persian Speech Recognition [0.0]
Time-frequency flexibility in some mammals' auditory neurons system improves recognition performance.
This paper proposes a CNN-based structure for time-frequency localization of audio signal information in the ASR acoustic model.
The average recognition score of TFCMNN models is about 1.6% higher than the average of conventional models.
arXiv Detail & Related papers (2021-08-09T05:46:58Z) - STAR: Sparse Transformer-based Action Recognition [61.490243467748314]
This work proposes a novel skeleton-based human action recognition model with sparse attention on the spatial dimension and segmented linear attention on the temporal dimension of data.
Experiments show that our model can achieve comparable performance while utilizing much less trainable parameters and achieve high speed in training and inference.
arXiv Detail & Related papers (2021-07-15T02:53:11Z) - A Variational Information Bottleneck Based Method to Compress Sequential
Networks for Human Action Recognition [9.414818018857316]
We propose a method to effectively compress Recurrent Neural Networks (RNNs) used for Human Action Recognition (HAR)
We use a Variational Information Bottleneck (VIB) theory-based pruning approach to limit the information flow through the sequential cells of RNNs to a small subset.
We combine our pruning method with a specific group-lasso regularization technique that significantly improves compression.
It is shown that our method achieves over 70 times greater compression than the nearest competitor with comparable accuracy for the task of action recognition on UCF11.
arXiv Detail & Related papers (2020-10-03T12:41:51Z) - Binary Neural Networks for Memory-Efficient and Effective Visual Place
Recognition in Changing Environments [24.674034243725455]
Visual place recognition (VPR) is a robot's ability to determine whether a place was visited before using visual data.
CNN-based approaches are unsuitable for resource-constrained platforms, such as small robots and drones.
We propose a new class of highly compact models that drastically reduces the memory requirements and computational effort.
arXiv Detail & Related papers (2020-10-01T22:59:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.