Multi-Temporal Convolutions for Human Action Recognition in Videos
- URL: http://arxiv.org/abs/2011.03949v2
- Date: Wed, 31 Mar 2021 15:02:49 GMT
- Title: Multi-Temporal Convolutions for Human Action Recognition in Videos
- Authors: Alexandros Stergiou and Ronald Poppe
- Abstract summary: We present a novel temporal-temporal convolution block that is capable of extracting at multiple resolutions.
The proposed blocks are lightweight and can be integrated into any 3D-CNN architecture.
- Score: 83.43682368129072
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Effective extraction of temporal patterns is crucial for the recognition of
temporally varying actions in video. We argue that the fixed-sized
spatio-temporal convolution kernels used in convolutional neural networks
(CNNs) can be improved to extract informative motions that are executed at
different time scales. To address this challenge, we present a novel
spatio-temporal convolution block that is capable of extracting spatio-temporal
patterns at multiple temporal resolutions. Our proposed multi-temporal
convolution (MTConv) blocks utilize two branches that focus on brief and
prolonged spatio-temporal patterns, respectively. The extracted time-varying
features are aligned in a third branch, with respect to global motion patterns
through recurrent cells. The proposed blocks are lightweight and can be
integrated into any 3D-CNN architecture. This introduces a substantial
reduction in computational costs. Extensive experiments on Kinetics, Moments in
Time and HACS action recognition benchmark datasets demonstrate competitive
performance of MTConvs compared to the state-of-the-art with a significantly
lower computational footprint.
Related papers
- Gait Recognition in the Wild with Multi-hop Temporal Switch [81.35245014397759]
gait recognition in the wild is a more practical problem that has attracted the attention of the community of multimedia and computer vision.
This paper presents a novel multi-hop temporal switch method to achieve effective temporal modeling of gait patterns in real-world scenes.
arXiv Detail & Related papers (2022-09-01T10:46:09Z) - Decoupling and Recoupling Spatiotemporal Representation for RGB-D-based
Motion Recognition [62.46544616232238]
Previous motion recognition methods have achieved promising performance through the tightly coupled multi-temporal representation.
We propose to decouple and recouple caused caused representation for RGB-D-based motion recognition.
arXiv Detail & Related papers (2021-12-16T18:59:47Z) - Spatio-Temporal Self-Attention Network for Video Saliency Prediction [13.873682190242365]
3D convolutional neural networks have achieved promising results for video tasks in computer vision.
We propose a novel Spatio-Temporal Self-Temporal Self-Attention 3 Network (STSANet) for video saliency prediction.
arXiv Detail & Related papers (2021-08-24T12:52:47Z) - Group-based Bi-Directional Recurrent Wavelet Neural Networks for Video
Super-Resolution [4.9136996406481135]
Video super-resolution (VSR) aims to estimate a high-resolution (HR) frame from a low-resolution (LR) frames.
Key challenge for VSR lies in the effective exploitation of spatial correlation in an intra-frame and temporal dependency between consecutive frames.
arXiv Detail & Related papers (2021-06-14T06:36:13Z) - Coarse-Fine Networks for Temporal Activity Detection in Videos [45.03545172714305]
We introduce 'Co-Fine Networks', a two-stream architecture which benefits from different abstractions of temporal resolution to learn better video representations for long-term motion.
We show that our method can outperform the state-of-the-arts for action detection in public datasets with a significantly reduced compute and memory footprint.
arXiv Detail & Related papers (2021-03-01T20:48:01Z) - Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for
Gesture Recognition [89.0152015268929]
We propose the first neural architecture search (NAS)-based method for RGB-D gesture recognition.
The proposed method includes two key components: 1) enhanced temporal representation via the 3D Central Difference Convolution (3D-CDC) family, and optimized backbones for multi-modal-rate branches and lateral connections.
The resultant multi-rate network provides a new perspective to understand the relationship between RGB and depth modalities and their temporal dynamics.
arXiv Detail & Related papers (2020-08-21T10:45:09Z) - A Real-time Action Representation with Temporal Encoding and Deep
Compression [115.3739774920845]
We propose a new real-time convolutional architecture, called Temporal Convolutional 3D Network (T-C3D), for action representation.
T-C3D learns video action representations in a hierarchical multi-granularity manner while obtaining a high process speed.
Our method achieves clear improvements on UCF101 action recognition benchmark against state-of-the-art real-time methods by 5.4% in terms of accuracy and 2 times faster in terms of inference speed with a less than 5MB storage model.
arXiv Detail & Related papers (2020-06-17T06:30:43Z) - Learn to cycle: Time-consistent feature discovery for action recognition [83.43682368129072]
Generalizing over temporal variations is a prerequisite for effective action recognition in videos.
We introduce Squeeze Re Temporal Gates (SRTG), an approach that favors temporal activations with potential variations.
We show consistent improvement when using SRTPG blocks, with only a minimal increase in the number of GFLOs.
arXiv Detail & Related papers (2020-06-15T09:36:28Z) - Temporal Interlacing Network [8.876132549551738]
temporal interlacing network (TIN) is a simple yet powerful operator for learning temporal features.
TIN fuses the two kinds of information by interlacing spatial representations from the past to the future.
TIN wins the $1st$ in the ICCV19 - Multi Moments in Time challenge.
arXiv Detail & Related papers (2020-01-17T19:06:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.