Directional Temporal Modeling for Action Recognition
- URL: http://arxiv.org/abs/2007.11040v1
- Date: Tue, 21 Jul 2020 18:49:57 GMT
- Title: Directional Temporal Modeling for Action Recognition
- Authors: Xinyu Li, Bing Shuai, Joseph Tighe
- Abstract summary: We introduce a channel independent directional convolution (CIDC) operation, which learns to model the temporal evolution among local features.
Our CIDC network can be attached to any activity recognition backbone network.
- Score: 24.805397801876687
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many current activity recognition models use 3D convolutional neural networks
(e.g. I3D, I3D-NL) to generate local spatial-temporal features. However, such
features do not encode clip-level ordered temporal information. In this paper,
we introduce a channel independent directional convolution (CIDC) operation,
which learns to model the temporal evolution among local features. By applying
multiple CIDC units we construct a light-weight network that models the
clip-level temporal evolution across multiple spatial scales. Our CIDC network
can be attached to any activity recognition backbone network. We evaluate our
method on four popular activity recognition datasets and consistently improve
upon state-of-the-art techniques. We further visualize the activation map of
our CIDC network and show that it is able to focus on more meaningful, action
related parts of the frame.
Related papers
- SeMLaPS: Real-time Semantic Mapping with Latent Prior Networks and
Quasi-Planar Segmentation [53.83313235792596]
We present a new methodology for real-time semantic mapping from RGB-D sequences.
It combines a 2D neural network and a 3D network based on a SLAM system with 3D occupancy mapping.
Our system achieves state-of-the-art semantic mapping quality within 2D-3D networks-based systems.
arXiv Detail & Related papers (2023-06-28T22:36:44Z) - Gait Recognition in the Wild with Multi-hop Temporal Switch [81.35245014397759]
gait recognition in the wild is a more practical problem that has attracted the attention of the community of multimedia and computer vision.
This paper presents a novel multi-hop temporal switch method to achieve effective temporal modeling of gait patterns in real-world scenes.
arXiv Detail & Related papers (2022-09-01T10:46:09Z) - Blockwise Temporal-Spatial Pathway Network [0.2538209532048866]
We propose a 3D-CNN-based action recognition model, called the blockwise temporal-spatial path-way network (BTSNet)
We designed a novel model inspired by an adaptive kernel selection-based model, which adaptively chooses spatial receptive fields for image recognition.
For evaluation, we tested our proposed model on UCF-101, HMDB-51, SVW, and EpicKitchen datasets.
arXiv Detail & Related papers (2022-08-05T08:43:30Z) - 3D Convolutional with Attention for Action Recognition [6.238518976312625]
Current action recognition methods use computationally expensive models for learning-temporal dependencies of the action.
This paper proposes a deep neural network architecture for learning such dependencies consisting of a 3D convolutional layer, fully connected layers and attention layer.
The method first learns spatial features and temporal of actions through 3D-CNN, and then the attention temporal mechanism helps the model to locate attention to essential features.
arXiv Detail & Related papers (2022-06-05T15:12:57Z) - A Spatio-Temporal Multilayer Perceptron for Gesture Recognition [70.34489104710366]
We propose a multilayer state-weighted perceptron for gesture recognition in the context of autonomous vehicles.
An evaluation of TCG and Drive&Act datasets is provided to showcase the promising performance of our approach.
We deploy our model to our autonomous vehicle to show its real-time capability and stable execution.
arXiv Detail & Related papers (2022-04-25T08:42:47Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z) - Anchor-Based Spatial-Temporal Attention Convolutional Networks for
Dynamic 3D Point Cloud Sequences [20.697745449159097]
Anchor-based Spatial-Temporal Attention Convolution operation (ASTAConv) is proposed in this paper to process dynamic 3D point cloud sequences.
The proposed convolution operation builds a regular receptive field around each point by setting several virtual anchors around each point.
The proposed method makes better use of the structured information within the local region, and learn spatial-temporal embedding features from dynamic 3D point cloud sequences.
arXiv Detail & Related papers (2020-12-20T07:35:37Z) - Multi-Temporal Convolutions for Human Action Recognition in Videos [83.43682368129072]
We present a novel temporal-temporal convolution block that is capable of extracting at multiple resolutions.
The proposed blocks are lightweight and can be integrated into any 3D-CNN architecture.
arXiv Detail & Related papers (2020-11-08T10:40:26Z) - Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for
Gesture Recognition [89.0152015268929]
We propose the first neural architecture search (NAS)-based method for RGB-D gesture recognition.
The proposed method includes two key components: 1) enhanced temporal representation via the 3D Central Difference Convolution (3D-CDC) family, and optimized backbones for multi-modal-rate branches and lateral connections.
The resultant multi-rate network provides a new perspective to understand the relationship between RGB and depth modalities and their temporal dynamics.
arXiv Detail & Related papers (2020-08-21T10:45:09Z) - Interpreting video features: a comparison of 3D convolutional networks
and convolutional LSTM networks [1.462434043267217]
We compare how 3D convolutional networks and convolutional LSTM networks learn features across temporally dependent frames.
Our findings indicate that the 3D convolutional model concentrates on shorter events in the input sequence, and places its spatial focus on fewer, contiguous areas.
arXiv Detail & Related papers (2020-02-02T11:27:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.