STSM: Spatio-Temporal Shift Module for Efficient Action Recognition
- URL: http://arxiv.org/abs/2112.02523v1
- Date: Sun, 5 Dec 2021 09:40:49 GMT
- Title: STSM: Spatio-Temporal Shift Module for Efficient Action Recognition
- Authors: Zhaoqilin Yang, Gaoyun An
- Abstract summary: We propose a plug-and-play Spatio-temporal Shift Module (STSM) that is both effective and high-performance.
In particular, when the network is 2D CNNs, our STSM module allows the network to learn efficient Spatio-temporal features.
- Score: 4.096670184726871
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The modeling, computational cost, and accuracy of traditional Spatio-temporal
networks are the three most concentrated research topics in video action
recognition. The traditional 2D convolution has a low computational cost, but
it cannot capture the time relationship; the convolutional neural networks
(CNNs) model based on 3D convolution can obtain good performance, but its
computational cost is high, and the amount of parameters is large. In this
paper, we propose a plug-and-play Spatio-temporal Shift Module (STSM), which is
a generic module that is both effective and high-performance. Specifically,
after STSM is inserted into other networks, the performance of the network can
be improved without increasing the number of calculations and parameters. In
particular, when the network is 2D CNNs, our STSM module allows the network to
learn efficient Spatio-temporal features. We conducted extensive evaluations of
the proposed module, conducted numerous experiments to study its effectiveness
in video action recognition, and achieved state-of-the-art results on the
kinetics-400 and Something-Something V2 datasets.
Related papers
- TCCT-Net: Two-Stream Network Architecture for Fast and Efficient Engagement Estimation via Behavioral Feature Signals [58.865901821451295]
We present a novel two-stream feature fusion "Tensor-Convolution and Convolution-Transformer Network" (TCCT-Net) architecture.
To better learn the meaningful patterns in the temporal-spatial domain, we design a "CT" stream that integrates a hybrid convolutional-transformer.
In parallel, to efficiently extract rich patterns from the temporal-frequency domain, we introduce a "TC" stream that uses Continuous Wavelet Transform (CWT) to represent information in a 2D tensor form.
arXiv Detail & Related papers (2024-04-15T06:01:48Z) - Exploring Quantization and Mapping Synergy in Hardware-Aware Deep Neural Network Accelerators [0.20971479389679332]
Energy efficiency and memory footprint of a convolutional neural network (CNN) implemented on a CNN inference accelerator depend on many factors.
We show that enabling rich mixed quantization schemes during the implementation can open a previously hidden space of mappings.
CNNs utilizing quantized weights and activations and suitable mappings can significantly improve trade-offs among the accuracy, energy, and memory requirements.
arXiv Detail & Related papers (2024-04-08T10:10:30Z) - Neural Attentive Circuits [93.95502541529115]
We introduce a general purpose, yet modular neural architecture called Neural Attentive Circuits (NACs)
NACs learn the parameterization and a sparse connectivity of neural modules without using domain knowledge.
NACs achieve an 8x speedup at inference time while losing less than 3% performance.
arXiv Detail & Related papers (2022-10-14T18:00:07Z) - 3D Convolutional with Attention for Action Recognition [6.238518976312625]
Current action recognition methods use computationally expensive models for learning-temporal dependencies of the action.
This paper proposes a deep neural network architecture for learning such dependencies consisting of a 3D convolutional layer, fully connected layers and attention layer.
The method first learns spatial features and temporal of actions through 3D-CNN, and then the attention temporal mechanism helps the model to locate attention to essential features.
arXiv Detail & Related papers (2022-06-05T15:12:57Z) - TSM: Temporal Shift Module for Efficient and Scalable Video
Understanding on Edge Device [58.776352999540435]
We propose a generic and effective Temporal Shift Module (TSM) that enjoys both high efficiency and high performance.
TSM is inserted into 2D CNNs to achieve temporal modeling at zero computation and zero parameters.
It achieves a high frame rate of 74 fps and 29 fps for online video recognition on Jetson Nano and Galaxy Note8.
arXiv Detail & Related papers (2021-09-27T17:59:39Z) - CT-Net: Channel Tensorization Network for Video Classification [48.4482794950675]
3D convolution is powerful for video classification but often computationally expensive.
Most approaches fail to achieve a preferable balance between convolutional efficiency and feature-interaction sufficiency.
We propose a concise and novel Channelization Network (CT-Net)
Our CT-Net outperforms a number of recent SOTA approaches, in terms of accuracy and/or efficiency.
arXiv Detail & Related papers (2021-06-03T05:35:43Z) - Physics Validation of Novel Convolutional 2D Architectures for Speeding
Up High Energy Physics Simulations [0.0]
We apply Geneversarative Adrial Networks (GANs), a deep learning technique, to replace the calorimeter detector simulations.
We develop new two-dimensional convolutional networks to solve the same 3D image generation problem faster.
Our results demonstrate a high physics accuracy and further consolidate the use of GANs for fast detector simulations.
arXiv Detail & Related papers (2021-05-19T07:24:23Z) - ACTION-Net: Multipath Excitation for Action Recognition [22.12530692711095]
We equip 2D CNNs with the proposed ACTION-Net to form a simple yet effective ACTION-Net with very limited extra computational cost.
ACTION-Net is demonstrated by consistently outperforming 2D CNN counterparts on three backbones.
arXiv Detail & Related papers (2021-03-11T16:23:40Z) - MVFNet: Multi-View Fusion Network for Efficient Video Recognition [79.92736306354576]
We introduce a multi-view fusion (MVF) module to exploit video complexity using separable convolution for efficiency.
MVFNet can be thought of as a generalized video modeling framework.
arXiv Detail & Related papers (2020-12-13T06:34:18Z) - Multi-Temporal Convolutions for Human Action Recognition in Videos [83.43682368129072]
We present a novel temporal-temporal convolution block that is capable of extracting at multiple resolutions.
The proposed blocks are lightweight and can be integrated into any 3D-CNN architecture.
arXiv Detail & Related papers (2020-11-08T10:40:26Z) - STH: Spatio-Temporal Hybrid Convolution for Efficient Action Recognition [39.58542259261567]
We present a novel S-Temporal Hybrid Network (STH) which simultaneously encodes spatial and temporal video information with a small parameter.
Such a design enables efficient-temporal modeling and maintains a small model scale.
STH enjoys performance superiority over 3D CNNs while maintaining an even smaller parameter cost than 2D CNNs.
arXiv Detail & Related papers (2020-03-18T04:46:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.