Depthwise Spatio-Temporal STFT Convolutional Neural Networks for Human
Action Recognition
- URL: http://arxiv.org/abs/2007.11365v1
- Date: Wed, 22 Jul 2020 12:26:04 GMT
- Title: Depthwise Spatio-Temporal STFT Convolutional Neural Networks for Human
Action Recognition
- Authors: Sudhakar Kumawat, Manisha Verma, Yuta Nakashima, and Shanmuganathan
Raman
- Abstract summary: Conventional 3D convolutional neural networks (CNNs) are computationally expensive, memory intensive, prone to overfitting and most importantly, there is a need to improve their feature learning capabilities.
We propose new class of convolutional blocks that can serve as an alternative to 3D convolutional layer and its variants in 3D CNNs.
Our evaluation on seven action recognition datasets, including Something-something v1 and v2, Jester, Diving Kinetics-400, UCF 101, and HMDB 51, demonstrate that STFT blocks based 3D CNNs achieve on par or even better performance compared to the state-of
- Score: 42.400429835080416
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Conventional 3D convolutional neural networks (CNNs) are computationally
expensive, memory intensive, prone to overfitting, and most importantly, there
is a need to improve their feature learning capabilities. To address these
issues, we propose spatio-temporal short term Fourier transform (STFT) blocks,
a new class of convolutional blocks that can serve as an alternative to the 3D
convolutional layer and its variants in 3D CNNs. An STFT block consists of
non-trainable convolution layers that capture spatially and/or temporally local
Fourier information using a STFT kernel at multiple low frequency points,
followed by a set of trainable linear weights for learning channel
correlations. The STFT blocks significantly reduce the space-time complexity in
3D CNNs. In general, they use 3.5 to 4.5 times less parameters and 1.5 to 1.8
times less computational costs when compared to the state-of-the-art methods.
Furthermore, their feature learning capabilities are significantly better than
the conventional 3D convolutional layer and its variants. Our extensive
evaluation on seven action recognition datasets, including Something-something
v1 and v2, Jester, Diving-48, Kinetics-400, UCF 101, and HMDB 51, demonstrate
that STFT blocks based 3D CNNs achieve on par or even better performance
compared to the state-of-the-art methods.
Related papers
- Smaller3d: Smaller Models for 3D Semantic Segmentation Using Minkowski
Engine and Knowledge Distillation Methods [0.0]
This paper proposes the application of knowledge distillation techniques, especially for sparse tensors in 3D deep learning, to reduce model sizes while maintaining performance.
We analyze and purpose different loss functions, including standard methods and combinations of various losses, to simulate the performance of state-of-the-art models of different Sparse Convolutional NNs.
arXiv Detail & Related papers (2023-05-04T22:19:25Z) - NAF: Neural Attenuation Fields for Sparse-View CBCT Reconstruction [79.13750275141139]
This paper proposes a novel and fast self-supervised solution for sparse-view CBCT reconstruction.
The desired attenuation coefficients are represented as a continuous function of 3D spatial coordinates, parameterized by a fully-connected deep neural network.
A learning-based encoder entailing hash coding is adopted to help the network capture high-frequency details.
arXiv Detail & Related papers (2022-09-29T04:06:00Z) - In Defense of Image Pre-Training for Spatiotemporal Recognition [32.56468478601864]
Key to effectively leveraging image pre-training lies in the decomposition of learning spatial and temporal features.
New pipeline consistently achieves better results on video recognition with significant speedup.
arXiv Detail & Related papers (2022-05-03T18:45:44Z) - Focal Sparse Convolutional Networks for 3D Object Detection [121.45950754511021]
We introduce two new modules to enhance the capability of Sparse CNNs.
They are focal sparse convolution (Focals Conv) and its multi-modal variant of focal sparse convolution with fusion.
For the first time, we show that spatially learnable sparsity in sparse convolution is essential for sophisticated 3D object detection.
arXiv Detail & Related papers (2022-04-26T17:34:10Z) - Gate-Shift-Fuse for Video Action Recognition [43.8525418821458]
Gate-Fuse (GSF) is a novel-temporal feature extraction module which controls interactions in-temporal decomposition and learns to adaptively route features through time and combine them in a data dependent manner.
GSF can be inserted into existing 2D CNNs to convert them into efficient and high performing, with negligible parameter and compute overhead.
We perform an extensive analysis of GSF using two popular 2D CNN families and achieve state-of-the-art or competitive performance on five standard action recognition benchmarks.
arXiv Detail & Related papers (2022-03-16T19:19:04Z) - CT-Net: Channel Tensorization Network for Video Classification [48.4482794950675]
3D convolution is powerful for video classification but often computationally expensive.
Most approaches fail to achieve a preferable balance between convolutional efficiency and feature-interaction sufficiency.
We propose a concise and novel Channelization Network (CT-Net)
Our CT-Net outperforms a number of recent SOTA approaches, in terms of accuracy and/or efficiency.
arXiv Detail & Related papers (2021-06-03T05:35:43Z) - RANP: Resource Aware Neuron Pruning at Initialization for 3D CNNs [32.054160078692036]
We introduce a Resource Aware Neuron Pruning (RANP) algorithm that prunes 3D CNNs to high sparsity levels.
Our algorithm leads to roughly 50%-95% reduction in FLOPs and 35%-80% reduction in memory with negligible loss in accuracy compared to the unpruned networks.
arXiv Detail & Related papers (2021-02-09T04:35:29Z) - 3D CNNs with Adaptive Temporal Feature Resolutions [83.43776851586351]
Similarity Guided Sampling (SGS) module can be plugged into any existing 3D CNN architecture.
SGS empowers 3D CNNs by learning the similarity of temporal features and grouping similar features together.
Our evaluations show that the proposed module improves the state-of-the-art by reducing the computational cost (GFLOPs) by half while preserving or even improving the accuracy.
arXiv Detail & Related papers (2020-11-17T14:34:05Z) - A Real-time Action Representation with Temporal Encoding and Deep
Compression [115.3739774920845]
We propose a new real-time convolutional architecture, called Temporal Convolutional 3D Network (T-C3D), for action representation.
T-C3D learns video action representations in a hierarchical multi-granularity manner while obtaining a high process speed.
Our method achieves clear improvements on UCF101 action recognition benchmark against state-of-the-art real-time methods by 5.4% in terms of accuracy and 2 times faster in terms of inference speed with a less than 5MB storage model.
arXiv Detail & Related papers (2020-06-17T06:30:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.