ViT-ReT: Vision and Recurrent Transformer Neural Networks for Human
Activity Recognition in Videos
- URL: http://arxiv.org/abs/2208.07929v1
- Date: Tue, 16 Aug 2022 20:03:53 GMT
- Title: ViT-ReT: Vision and Recurrent Transformer Neural Networks for Human
Activity Recognition in Videos
- Authors: James Wensel, Hayat Ullah, Arslan Munir, Erik Blasch
- Abstract summary: This paper proposes and designs two transformer neural networks for human activity recognition.
A recurrent transformer (ReT) is a specialized neural network used to make predictions on sequences of data, and a vision transformer (ViT) is a vision transformer optimized for extracting salient features from images.
We have provided an extensive comparison of the proposed transformer neural networks with the contemporary CNN and RNN-based human activity recognition models in terms of speed and accuracy.
- Score: 6.117917355232902
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Human activity recognition is an emerging and important area in computer
vision which seeks to determine the activity an individual or group of
individuals are performing. The applications of this field ranges from
generating highlight videos in sports, to intelligent surveillance and gesture
recognition. Most activity recognition systems rely on a combination of
convolutional neural networks (CNNs) to perform feature extraction from the
data and recurrent neural networks (RNNs) to determine the time dependent
nature of the data. This paper proposes and designs two transformer neural
networks for human activity recognition: a recurrent transformer (ReT), a
specialized neural network used to make predictions on sequences of data, as
well as a vision transformer (ViT), a transformer optimized for extracting
salient features from images, to improve speed and scalability of activity
recognition. We have provided an extensive comparison of the proposed
transformer neural networks with the contemporary CNN and RNN-based human
activity recognition models in terms of speed and accuracy.
Related papers
- Graph Neural Networks for Learning Equivariant Representations of Neural Networks [55.04145324152541]
We propose to represent neural networks as computational graphs of parameters.
Our approach enables a single model to encode neural computational graphs with diverse architectures.
We showcase the effectiveness of our method on a wide range of tasks, including classification and editing of implicit neural representations.
arXiv Detail & Related papers (2024-03-18T18:01:01Z) - Design and development of opto-neural processors for simulation of
neural networks trained in image detection for potential implementation in
hybrid robotics [0.0]
Living neural networks offer advantages of lower power consumption, faster processing, and biological realism.
This work proposes a simulated living neural network trained indirectly by backpropagating STDP based algorithms using precision activation by optogenetics.
arXiv Detail & Related papers (2024-01-17T04:42:49Z) - ConViViT -- A Deep Neural Network Combining Convolutions and Factorized
Self-Attention for Human Activity Recognition [3.6321891270689055]
We propose a novel approach that leverages the strengths of both CNNs and Transformers in a hybrid architecture for performing activity recognition using RGB videos.
Our architecture has achieved new SOTA results with 90.05 %, 99.6%, and 95.09% on HMDB51, UCF101, and ETRI-Activity3D respectively.
arXiv Detail & Related papers (2023-10-22T21:13:43Z) - Training Robust Spiking Neural Networks with ViewPoint Transform and
SpatioTemporal Stretching [4.736525128377909]
We propose a novel data augmentation method, ViewPoint Transform and Spatio Stretching (VPT-STS)
It improves the robustness of spiking neural networks by transforming the rotation centers and angles in thetemporal domain to generate samples from different viewpoints.
Experiments on prevailing neuromorphic datasets demonstrate that VPT-STS is broadly effective on multi-event representations and significantly outperforms pure spatial geometric transformations.
arXiv Detail & Related papers (2023-03-14T03:09:56Z) - Video Action Recognition Collaborative Learning with Dynamics via
PSO-ConvNet Transformer [1.876462046907555]
We propose a novel PSO-ConvNet model for learning actions in videos.
Our experimental results on the UCF-101 dataset demonstrate substantial improvements of up to 9% in accuracy.
Overall, our dynamic PSO-ConvNet model provides a promising direction for improving Human Action Recognition.
arXiv Detail & Related papers (2023-02-17T23:39:34Z) - Differentiable Frequency-based Disentanglement for Aerial Video Action
Recognition [56.91538445510214]
We present a learning algorithm for human activity recognition in videos.
Our approach is designed for UAV videos, which are mainly acquired from obliquely placed dynamic cameras.
We conduct extensive experiments on the UAV Human dataset and the NEC Drone dataset.
arXiv Detail & Related papers (2022-09-15T22:16:52Z) - Data-driven emergence of convolutional structure in neural networks [83.4920717252233]
We show how fully-connected neural networks solving a discrimination task can learn a convolutional structure directly from their inputs.
By carefully designing data models, we show that the emergence of this pattern is triggered by the non-Gaussian, higher-order local structure of the inputs.
arXiv Detail & Related papers (2022-02-01T17:11:13Z) - Event-based Video Reconstruction via Potential-assisted Spiking Neural
Network [48.88510552931186]
Bio-inspired neural networks can potentially lead to greater computational efficiency on event-driven hardware.
We propose a novel Event-based Video reconstruction framework based on a fully Spiking Neural Network (EVSNN)
We find that the spiking neurons have the potential to store useful temporal information (memory) to complete such time-dependent tasks.
arXiv Detail & Related papers (2022-01-25T02:05:20Z) - Hybrid SNN-ANN: Energy-Efficient Classification and Object Detection for
Event-Based Vision [64.71260357476602]
Event-based vision sensors encode local pixel-wise brightness changes in streams of events rather than image frames.
Recent progress in object recognition from event-based sensors has come from conversions of deep neural networks.
We propose a hybrid architecture for end-to-end training of deep neural networks for event-based pattern recognition and object detection.
arXiv Detail & Related papers (2021-12-06T23:45:58Z) - A Study On the Effects of Pre-processing On Spatio-temporal Action
Recognition Using Spiking Neural Networks Trained with STDP [0.0]
It is important to study the behavior of SNNs trained with unsupervised learning methods on video classification tasks.
This paper presents methods of transposing temporal information into a static format, and then transforming the visual information into spikes using latency coding.
We show the effect of the similarity in the shape and speed of certain actions on action recognition with spiking neural networks.
arXiv Detail & Related papers (2021-05-31T07:07:48Z) - Neuroevolution of a Recurrent Neural Network for Spatial and Working
Memory in a Simulated Robotic Environment [57.91534223695695]
We evolved weights in a biologically plausible recurrent neural network (RNN) using an evolutionary algorithm to replicate the behavior and neural activity observed in rats.
Our method demonstrates how the dynamic activity in evolved RNNs can capture interesting and complex cognitive behavior.
arXiv Detail & Related papers (2021-02-25T02:13:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.