Three-stream network for enriched Action Recognition
- URL: http://arxiv.org/abs/2104.13051v1
- Date: Tue, 27 Apr 2021 08:56:11 GMT
- Title: Three-stream network for enriched Action Recognition
- Authors: Ivaxi Sheth
- Abstract summary: This paper proposes two CNN based architectures with three streams which allow the model to exploit the dataset under different settings.
By experimenting with various algorithms on UCF-101, Kinetics-600 and AVA dataset, we observe that the proposed models achieve state-of-art performance for human action recognition task.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding accurate information on human behaviours is one of the most
important tasks in machine intelligence. Human Activity Recognition that aims
to understand human activities from a video is a challenging task due to
various problems including background, camera motion and dataset variations.
This paper proposes two CNN based architectures with three streams which allow
the model to exploit the dataset under different settings. The three pathways
are differentiated in frame rates. The single pathway, operates at a single
frame rate captures spatial information, the slow pathway operates at low frame
rates captures the spatial information and the fast pathway operates at high
frame rates that capture fine temporal information. Post CNN encoders, we add
bidirectional LSTM and attention heads respectively to capture the context and
temporal features. By experimenting with various algorithms on UCF-101,
Kinetics-600 and AVA dataset, we observe that the proposed models achieve
state-of-art performance for human action recognition task.
Related papers
- Action Recognition with Multi-stream Motion Modeling and Mutual
Information Maximization [44.73161606369333]
Action recognition is a fundamental and intriguing problem in artificial intelligence.
We introduce a novel Stream-GCN network equipped with multi-stream components and channel attention.
Our approach sets the new state-of-the-art performance on three benchmark datasets.
arXiv Detail & Related papers (2023-06-13T06:56:09Z) - Human activity recognition using deep learning approaches and single
frame cnn and convolutional lstm [0.0]
We explore two deep learning-based approaches, namely single frame Convolutional Neural Networks (CNNs) and convolutional Long Short-Term Memory to recognise human actions from videos.
The two models were trained and evaluated on a benchmark action recognition dataset, UCF50, and another dataset that was created for the experimentation.
Though both models exhibit good accuracies, the single frame CNN model outperforms the Convolutional LSTM model by having an accuracy of 99.8% with the UCF50 dataset.
arXiv Detail & Related papers (2023-04-18T01:33:29Z) - Differentiable Frequency-based Disentanglement for Aerial Video Action
Recognition [56.91538445510214]
We present a learning algorithm for human activity recognition in videos.
Our approach is designed for UAV videos, which are mainly acquired from obliquely placed dynamic cameras.
We conduct extensive experiments on the UAV Human dataset and the NEC Drone dataset.
arXiv Detail & Related papers (2022-09-15T22:16:52Z) - Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for
Temporal Sentence Grounding [61.57847727651068]
Temporal sentence grounding aims to localize a target segment in an untrimmed video semantically according to a given sentence query.
Most previous works focus on learning frame-level features of each whole frame in the entire video, and directly match them with the textual information.
We propose a novel Motion- and Appearance-guided 3D Semantic Reasoning Network (MA3SRN), which incorporates optical-flow-guided motion-aware, detection-based appearance-aware, and 3D-aware object-level features.
arXiv Detail & Related papers (2022-03-06T13:57:09Z) - The influence of labeling techniques in classifying human manipulation
movement of different speed [2.9972063833424216]
We investigate the influence of labeling methods on the classification of human movements on data recorded using a marker-based motion capture system.
The dataset is labeled using two different approaches, one based on video data of the movements, the other based on the movement trajectories recorded using the motion capture system.
arXiv Detail & Related papers (2022-02-04T23:04:22Z) - Learning Multi-Granular Spatio-Temporal Graph Network for Skeleton-based
Action Recognition [49.163326827954656]
We propose a novel multi-granular-temporal graph network for skeleton-based action classification.
We develop a dual-head graph network consisting of two inter-leaved branches, which enables us to extract at least two-temporal resolutions.
We conduct extensive experiments on three large-scale datasets.
arXiv Detail & Related papers (2021-08-10T09:25:07Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z) - Learning Comprehensive Motion Representation for Action Recognition [124.65403098534266]
2D CNN-based methods are efficient but may yield redundant features due to applying the same 2D convolution kernel to each frame.
Recent efforts attempt to capture motion information by establishing inter-frame connections while still suffering the limited temporal receptive field or high latency.
We propose a Channel-wise Motion Enhancement (CME) module to adaptively emphasize the channels related to dynamic information with a channel-wise gate vector.
We also propose a Spatial-wise Motion Enhancement (SME) module to focus on the regions with the critical target in motion, according to the point-to-point similarity between adjacent feature maps.
arXiv Detail & Related papers (2021-03-23T03:06:26Z) - Relational Graph Learning on Visual and Kinematics Embeddings for
Accurate Gesture Recognition in Robotic Surgery [84.73764603474413]
We propose a novel online approach of multi-modal graph network (i.e., MRG-Net) to dynamically integrate visual and kinematics information.
The effectiveness of our method is demonstrated with state-of-the-art results on the public JIGSAWS dataset.
arXiv Detail & Related papers (2020-11-03T11:00:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.