3D Convolutional with Attention for Action Recognition
- URL: http://arxiv.org/abs/2206.02203v1
- Date: Sun, 5 Jun 2022 15:12:57 GMT
- Title: 3D Convolutional with Attention for Action Recognition
- Authors: Labina Shrestha, Shikha Dubey, Farrukh Olimov, Muhammad Aasim Rafique,
Moongu Jeon
- Abstract summary: Current action recognition methods use computationally expensive models for learning-temporal dependencies of the action.
This paper proposes a deep neural network architecture for learning such dependencies consisting of a 3D convolutional layer, fully connected layers and attention layer.
The method first learns spatial features and temporal of actions through 3D-CNN, and then the attention temporal mechanism helps the model to locate attention to essential features.
- Score: 6.238518976312625
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human action recognition is one of the challenging tasks in computer vision.
The current action recognition methods use computationally expensive models for
learning spatio-temporal dependencies of the action. Models utilizing RGB
channels and optical flow separately, models using a two-stream fusion
technique, and models consisting of both convolutional neural network (CNN) and
long-short term memory (LSTM) network are few examples of such complex models.
Moreover, fine-tuning such complex models is computationally expensive as well.
This paper proposes a deep neural network architecture for learning such
dependencies consisting of a 3D convolutional layer, fully connected (FC)
layers, and attention layer, which is simpler to implement and gives a
competitive performance on the UCF-101 dataset. The proposed method first
learns spatial and temporal features of actions through 3D-CNN, and then the
attention mechanism helps the model to locate attention to essential features
for recognition.
Related papers
- Deepfake Detection: Leveraging the Power of 2D and 3D CNN Ensembles [0.0]
This work presents an innovative approach to validate video content.
The methodology blends advanced 2-dimensional and 3-dimensional Convolutional Neural Networks.
Experimental validation underscores the effectiveness of this strategy, showcasing its potential in countering deepfakes generation.
arXiv Detail & Related papers (2023-10-25T06:00:37Z) - Gait Recognition in the Wild with Multi-hop Temporal Switch [81.35245014397759]
gait recognition in the wild is a more practical problem that has attracted the attention of the community of multimedia and computer vision.
This paper presents a novel multi-hop temporal switch method to achieve effective temporal modeling of gait patterns in real-world scenes.
arXiv Detail & Related papers (2022-09-01T10:46:09Z) - STSM: Spatio-Temporal Shift Module for Efficient Action Recognition [4.096670184726871]
We propose a plug-and-play Spatio-temporal Shift Module (STSM) that is both effective and high-performance.
In particular, when the network is 2D CNNs, our STSM module allows the network to learn efficient Spatio-temporal features.
arXiv Detail & Related papers (2021-12-05T09:40:49Z) - Learning A 3D-CNN and Transformer Prior for Hyperspectral Image
Super-Resolution [80.93870349019332]
We propose a novel HSISR method that uses Transformer instead of CNN to learn the prior of HSIs.
Specifically, we first use the gradient algorithm to solve the HSISR model, and then use an unfolding network to simulate the iterative solution processes.
arXiv Detail & Related papers (2021-11-27T15:38:57Z) - Scene Synthesis via Uncertainty-Driven Attribute Synchronization [52.31834816911887]
This paper introduces a novel neural scene synthesis approach that can capture diverse feature patterns of 3D scenes.
Our method combines the strength of both neural network-based and conventional scene synthesis approaches.
arXiv Detail & Related papers (2021-08-30T19:45:07Z) - Learning Multi-Granular Spatio-Temporal Graph Network for Skeleton-based
Action Recognition [49.163326827954656]
We propose a novel multi-granular-temporal graph network for skeleton-based action classification.
We develop a dual-head graph network consisting of two inter-leaved branches, which enables us to extract at least two-temporal resolutions.
We conduct extensive experiments on three large-scale datasets.
arXiv Detail & Related papers (2021-08-10T09:25:07Z) - PredRNN: A Recurrent Neural Network for Spatiotemporal Predictive
Learning [109.84770951839289]
We present PredRNN, a new recurrent network for learning visual dynamics from historical context.
We show that our approach obtains highly competitive results on three standard datasets.
arXiv Detail & Related papers (2021-03-17T08:28:30Z) - Model-inspired Deep Learning for Light-Field Microscopy with Application
to Neuron Localization [27.247818386065894]
We propose a model-inspired deep learning approach to perform fast and robust 3D localization of sources using light-field microscopy images.
This is achieved by developing a deep network that efficiently solves a convolutional sparse coding problem.
Experiments on localization of mammalian neurons from light-fields show that the proposed approach simultaneously provides enhanced performance, interpretability and efficiency.
arXiv Detail & Related papers (2021-03-10T16:24:47Z) - Directional Temporal Modeling for Action Recognition [24.805397801876687]
We introduce a channel independent directional convolution (CIDC) operation, which learns to model the temporal evolution among local features.
Our CIDC network can be attached to any activity recognition backbone network.
arXiv Detail & Related papers (2020-07-21T18:49:57Z) - Disentangling and Unifying Graph Convolutions for Skeleton-Based Action
Recognition [79.33539539956186]
We propose a simple method to disentangle multi-scale graph convolutions and a unified spatial-temporal graph convolutional operator named G3D.
By coupling these proposals, we develop a powerful feature extractor named MS-G3D based on which our model outperforms previous state-of-the-art methods on three large-scale datasets.
arXiv Detail & Related papers (2020-03-31T11:28:25Z) - Interpreting video features: a comparison of 3D convolutional networks
and convolutional LSTM networks [1.462434043267217]
We compare how 3D convolutional networks and convolutional LSTM networks learn features across temporally dependent frames.
Our findings indicate that the 3D convolutional model concentrates on shorter events in the input sequence, and places its spatial focus on fewer, contiguous areas.
arXiv Detail & Related papers (2020-02-02T11:27:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.