STH: Spatio-Temporal Hybrid Convolution for Efficient Action Recognition
- URL: http://arxiv.org/abs/2003.08042v1
- Date: Wed, 18 Mar 2020 04:46:30 GMT
- Title: STH: Spatio-Temporal Hybrid Convolution for Efficient Action Recognition
- Authors: Xu Li, Jingwen Wang, Lin Ma, Kaihao Zhang, Fengzong Lian, Zhanhui Kang
and Jinjun Wang
- Abstract summary: We present a novel S-Temporal Hybrid Network (STH) which simultaneously encodes spatial and temporal video information with a small parameter.
Such a design enables efficient-temporal modeling and maintains a small model scale.
STH enjoys performance superiority over 3D CNNs while maintaining an even smaller parameter cost than 2D CNNs.
- Score: 39.58542259261567
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Effective and Efficient spatio-temporal modeling is essential for action
recognition. Existing methods suffer from the trade-off between model
performance and model complexity. In this paper, we present a novel
Spatio-Temporal Hybrid Convolution Network (denoted as "STH") which
simultaneously encodes spatial and temporal video information with a small
parameter cost. Different from existing works that sequentially or parallelly
extract spatial and temporal information with different convolutional layers,
we divide the input channels into multiple groups and interleave the spatial
and temporal operations in one convolutional layer, which deeply incorporates
spatial and temporal clues. Such a design enables efficient spatio-temporal
modeling and maintains a small model scale. STH-Conv is a general building
block, which can be plugged into existing 2D CNN architectures such as ResNet
and MobileNet by replacing the conventional 2D-Conv blocks (2D convolutions).
STH network achieves competitive or even better performance than its
competitors on benchmark datasets such as Something-Something (V1 & V2),
Jester, and HMDB-51. Moreover, STH enjoys performance superiority over 3D CNNs
while maintaining an even smaller parameter cost than 2D CNNs.
Related papers
- Dynamic 3D Point Cloud Sequences as 2D Videos [81.46246338686478]
3D point cloud sequences serve as one of the most common and practical representation modalities of real-world environments.
We propose a novel generic representation called textitStructured Point Cloud Videos (SPCVs)
SPCVs re-organizes a point cloud sequence as a 2D video with spatial smoothness and temporal consistency, where the pixel values correspond to the 3D coordinates of points.
arXiv Detail & Related papers (2024-03-02T08:18:57Z) - Disentangling Spatial and Temporal Learning for Efficient Image-to-Video
Transfer Learning [59.26623999209235]
We present DiST, which disentangles the learning of spatial and temporal aspects of videos.
The disentangled learning in DiST is highly efficient because it avoids the back-propagation of massive pre-trained parameters.
Extensive experiments on five benchmarks show that DiST delivers better performance than existing state-of-the-art methods by convincing gaps.
arXiv Detail & Related papers (2023-09-14T17:58:33Z) - Spatiotemporal Modeling Encounters 3D Medical Image Analysis:
Slice-Shift UNet with Multi-View Fusion [0.0]
We propose a new 2D-based model dubbed Slice SHift UNet which encodes three-dimensional features at 2D CNN's complexity.
More precisely multi-view features are collaboratively learned by performing 2D convolutions along the three planes of a volume.
The effectiveness of our approach is validated in Multi-Modality Abdominal Multi-Organ axis (AMOS) and Multi-Atlas Labeling Beyond the Cranial Vault (BTCV) datasets.
arXiv Detail & Related papers (2023-07-24T14:53:23Z) - Gait Recognition in the Wild with Multi-hop Temporal Switch [81.35245014397759]
gait recognition in the wild is a more practical problem that has attracted the attention of the community of multimedia and computer vision.
This paper presents a novel multi-hop temporal switch method to achieve effective temporal modeling of gait patterns in real-world scenes.
arXiv Detail & Related papers (2022-09-01T10:46:09Z) - MVFNet: Multi-View Fusion Network for Efficient Video Recognition [79.92736306354576]
We introduce a multi-view fusion (MVF) module to exploit video complexity using separable convolution for efficiency.
MVFNet can be thought of as a generalized video modeling framework.
arXiv Detail & Related papers (2020-12-13T06:34:18Z) - Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization
for Efficient Video Classification [12.787763599624173]
We propose an efficient temporal modeling 3D architecture, called VoV3D, that consists of a temporal one-shot aggregation (T-OSA) module and depthwise factorized component, D(2+1)D.
Thanks to its efficiency and effectiveness of temporal modeling, VoV3D-L has 6x fewer model parameters and 16x less computation, surpassing a state-of-the-art temporal modeling method on both SomethingSomething and Kinetics.
arXiv Detail & Related papers (2020-12-01T07:40:06Z) - Multi-Temporal Convolutions for Human Action Recognition in Videos [83.43682368129072]
We present a novel temporal-temporal convolution block that is capable of extracting at multiple resolutions.
The proposed blocks are lightweight and can be integrated into any 3D-CNN architecture.
arXiv Detail & Related papers (2020-11-08T10:40:26Z) - Comparison of Spatiotemporal Networks for Learning Video Related Tasks [0.0]
Many methods for learning from sequences involve temporally processing 2D CNN features from the individual frames or directly utilizing 3D convolutions within high-performing 2D CNN architectures.
This work constructs an MNIST-based video dataset with parameters controlling relevant facets of common video-related tasks: classification, ordering, and speed estimation.
Models trained on this dataset are shown to differ in key ways depending on the task and their use of 2D convolutions, 3D convolutions, or convolutional LSTMs.
arXiv Detail & Related papers (2020-09-15T19:57:50Z) - MixTConv: Mixed Temporal Convolutional Kernels for Efficient Action
Recogntion [16.067602635607965]
MixTConv consists of multiple depthwise 1D convolutional filters with different kernel sizes.
We propose an efficient and effective network architecture named MSTNet for action recognition, achieve state-of-the-art results on multiple benchmarks.
arXiv Detail & Related papers (2020-01-19T04:21:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.