Gate-Shift-Fuse for Video Action Recognition
- URL: http://arxiv.org/abs/2203.08897v3
- Date: Sat, 15 Apr 2023 13:06:27 GMT
- Title: Gate-Shift-Fuse for Video Action Recognition
- Authors: Swathikiran Sudhakaran, Sergio Escalera, Oswald Lanz
- Abstract summary: Gate-Fuse (GSF) is a novel-temporal feature extraction module which controls interactions in-temporal decomposition and learns to adaptively route features through time and combine them in a data dependent manner.
GSF can be inserted into existing 2D CNNs to convert them into efficient and high performing, with negligible parameter and compute overhead.
We perform an extensive analysis of GSF using two popular 2D CNN families and achieve state-of-the-art or competitive performance on five standard action recognition benchmarks.
- Score: 43.8525418821458
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Convolutional Neural Networks are the de facto models for image recognition.
However 3D CNNs, the straight forward extension of 2D CNNs for video
recognition, have not achieved the same success on standard action recognition
benchmarks. One of the main reasons for this reduced performance of 3D CNNs is
the increased computational complexity requiring large scale annotated datasets
to train them in scale. 3D kernel factorization approaches have been proposed
to reduce the complexity of 3D CNNs. Existing kernel factorization approaches
follow hand-designed and hard-wired techniques. In this paper we propose
Gate-Shift-Fuse (GSF), a novel spatio-temporal feature extraction module which
controls interactions in spatio-temporal decomposition and learns to adaptively
route features through time and combine them in a data dependent manner. GSF
leverages grouped spatial gating to decompose input tensor and channel
weighting to fuse the decomposed tensors. GSF can be inserted into existing 2D
CNNs to convert them into an efficient and high performing spatio-temporal
feature extractor, with negligible parameter and compute overhead. We perform
an extensive analysis of GSF using two popular 2D CNN families and achieve
state-of-the-art or competitive performance on five standard action recognition
benchmarks.
Related papers
- Maximizing Spatio-Temporal Entropy of Deep 3D CNNs for Efficient Video
Recognition [25.364148451584356]
3D convolution neural networks (CNNs) have been the prevailing option for video recognition.
We propose to automatically design efficient 3D CNN architectures via a novel training-free neural architecture search approach.
Experiments on Something-Something V1&V2 and Kinetics400 demonstrate that the E3D family achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-03-05T15:11:53Z) - Keypoint Message Passing for Video-based Person Re-Identification [106.41022426556776]
Video-based person re-identification (re-ID) is an important technique in visual surveillance systems which aims to match video snippets of people captured by different cameras.
Existing methods are mostly based on convolutional neural networks (CNNs), whose building blocks either process local neighbor pixels at a time, or, when 3D convolutions are used to model temporal information, suffer from the misalignment problem caused by person movement.
In this paper, we propose to overcome the limitations of normal convolutions with a human-oriented graph method. Specifically, features located at person joint keypoints are extracted and connected as a spatial-temporal graph
arXiv Detail & Related papers (2021-11-16T08:01:16Z) - Continual 3D Convolutional Neural Networks for Real-time Processing of
Videos [93.73198973454944]
We introduce Continual 3D Contemporalal Neural Networks (Co3D CNNs)
Co3D CNNs process videos frame-by-frame rather than by clip by clip.
We show that Co3D CNNs initialised on the weights from preexisting state-of-the-art video recognition models reduce floating point operations for frame-wise computations by 10.0-12.4x while improving accuracy on Kinetics-400 by 2.3-3.8x.
arXiv Detail & Related papers (2021-05-31T18:30:52Z) - 3D CNNs with Adaptive Temporal Feature Resolutions [83.43776851586351]
Similarity Guided Sampling (SGS) module can be plugged into any existing 3D CNN architecture.
SGS empowers 3D CNNs by learning the similarity of temporal features and grouping similar features together.
Our evaluations show that the proposed module improves the state-of-the-art by reducing the computational cost (GFLOPs) by half while preserving or even improving the accuracy.
arXiv Detail & Related papers (2020-11-17T14:34:05Z) - Making a Case for 3D Convolutions for Object Segmentation in Videos [16.167397418720483]
We show that 3D convolutional networks can be effectively applied to dense video prediction tasks such as salient object segmentation.
We propose a 3D decoder architecture, that comprises novel 3D Global Convolution layers and 3D Refinement modules.
Our approach outperforms existing state-of-the-arts by a large margin on the DAVIS'16 Unsupervised, FBMS and ViSal benchmarks.
arXiv Detail & Related papers (2020-08-26T12:24:23Z) - Depthwise Spatio-Temporal STFT Convolutional Neural Networks for Human
Action Recognition [42.400429835080416]
Conventional 3D convolutional neural networks (CNNs) are computationally expensive, memory intensive, prone to overfitting and most importantly, there is a need to improve their feature learning capabilities.
We propose new class of convolutional blocks that can serve as an alternative to 3D convolutional layer and its variants in 3D CNNs.
Our evaluation on seven action recognition datasets, including Something-something v1 and v2, Jester, Diving Kinetics-400, UCF 101, and HMDB 51, demonstrate that STFT blocks based 3D CNNs achieve on par or even better performance compared to the state-of
arXiv Detail & Related papers (2020-07-22T12:26:04Z) - RT3D: Achieving Real-Time Execution of 3D Convolutional Neural Networks
on Mobile Devices [57.877112704841366]
This paper proposes RT3D, a model compression and mobile acceleration framework for 3D CNNs.
For the first time, real-time execution of 3D CNNs is achieved on off-the-shelf mobiles.
arXiv Detail & Related papers (2020-07-20T02:05:32Z) - A Real-time Action Representation with Temporal Encoding and Deep
Compression [115.3739774920845]
We propose a new real-time convolutional architecture, called Temporal Convolutional 3D Network (T-C3D), for action representation.
T-C3D learns video action representations in a hierarchical multi-granularity manner while obtaining a high process speed.
Our method achieves clear improvements on UCF101 action recognition benchmark against state-of-the-art real-time methods by 5.4% in terms of accuracy and 2 times faster in terms of inference speed with a less than 5MB storage model.
arXiv Detail & Related papers (2020-06-17T06:30:43Z) - A Fast 3D CNN for Hyperspectral Image Classification [0.456877715768796]
Hyperspectral imaging (HSI) has been extensively utilized for a number of real-world applications.
A 2D Convolutional Neural Network (CNN) is a viable approach whereby HSIC highly depends on both Spectral-Spatial information.
This work proposed a 3D CNN model that utilizes both spatial-spectral feature maps to attain good performance.
arXiv Detail & Related papers (2020-04-29T12:57:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.