ACTION-Net: Multipath Excitation for Action Recognition
- URL: http://arxiv.org/abs/2103.07372v1
- Date: Thu, 11 Mar 2021 16:23:40 GMT
- Title: ACTION-Net: Multipath Excitation for Action Recognition
- Authors: Zhengwei Wang, Qi She, Aljosa Smolic
- Abstract summary: We equip 2D CNNs with the proposed ACTION-Net to form a simple yet effective ACTION-Net with very limited extra computational cost.
ACTION-Net is demonstrated by consistently outperforming 2D CNN counterparts on three backbones.
- Score: 22.12530692711095
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Spatial-temporal, channel-wise, and motion patterns are three complementary
and crucial types of information for video action recognition. Conventional 2D
CNNs are computationally cheap but cannot catch temporal relationships; 3D CNNs
can achieve good performance but are computationally intensive. In this work,
we tackle this dilemma by designing a generic and effective module that can be
embedded into 2D CNNs. To this end, we propose a spAtio-temporal, Channel and
moTion excitatION (ACTION) module consisting of three paths: Spatio-Temporal
Excitation (STE) path, Channel Excitation (CE) path, and Motion Excitation (ME)
path. The STE path employs one channel 3D convolution to characterize
spatio-temporal representation. The CE path adaptively recalibrates
channel-wise feature responses by explicitly modeling interdependencies between
channels in terms of the temporal aspect. The ME path calculates feature-level
temporal differences, which is then utilized to excite motion-sensitive
channels. We equip 2D CNNs with the proposed ACTION module to form a simple yet
effective ACTION-Net with very limited extra computational cost. ACTION-Net is
demonstrated by consistently outperforming 2D CNN counterparts on three
backbones (i.e., ResNet-50, MobileNet V2 and BNInception) employing three
datasets (i.e., Something-Something V2, Jester, and EgoGesture). Codes are
available at \url{https://github.com/V-Sense/ACTION-Net}.
Related papers
- Blockwise Temporal-Spatial Pathway Network [0.2538209532048866]
We propose a 3D-CNN-based action recognition model, called the blockwise temporal-spatial path-way network (BTSNet)
We designed a novel model inspired by an adaptive kernel selection-based model, which adaptively chooses spatial receptive fields for image recognition.
For evaluation, we tested our proposed model on UCF-101, HMDB-51, SVW, and EpicKitchen datasets.
arXiv Detail & Related papers (2022-08-05T08:43:30Z) - Gate-Shift-Fuse for Video Action Recognition [43.8525418821458]
Gate-Fuse (GSF) is a novel-temporal feature extraction module which controls interactions in-temporal decomposition and learns to adaptively route features through time and combine them in a data dependent manner.
GSF can be inserted into existing 2D CNNs to convert them into efficient and high performing, with negligible parameter and compute overhead.
We perform an extensive analysis of GSF using two popular 2D CNN families and achieve state-of-the-art or competitive performance on five standard action recognition benchmarks.
arXiv Detail & Related papers (2022-03-16T19:19:04Z) - STSM: Spatio-Temporal Shift Module for Efficient Action Recognition [4.096670184726871]
We propose a plug-and-play Spatio-temporal Shift Module (STSM) that is both effective and high-performance.
In particular, when the network is 2D CNNs, our STSM module allows the network to learn efficient Spatio-temporal features.
arXiv Detail & Related papers (2021-12-05T09:40:49Z) - EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content.
First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events.
Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z) - Learning Comprehensive Motion Representation for Action Recognition [124.65403098534266]
2D CNN-based methods are efficient but may yield redundant features due to applying the same 2D convolution kernel to each frame.
Recent efforts attempt to capture motion information by establishing inter-frame connections while still suffering the limited temporal receptive field or high latency.
We propose a Channel-wise Motion Enhancement (CME) module to adaptively emphasize the channels related to dynamic information with a channel-wise gate vector.
We also propose a Spatial-wise Motion Enhancement (SME) module to focus on the regions with the critical target in motion, according to the point-to-point similarity between adjacent feature maps.
arXiv Detail & Related papers (2021-03-23T03:06:26Z) - MoViNets: Mobile Video Networks for Efficient Video Recognition [52.49314494202433]
3D convolutional neural networks (CNNs) are accurate at video recognition but require large computation and memory budgets.
We propose a three-step approach to improve computational efficiency while substantially reducing the peak memory usage of 3D CNNs.
arXiv Detail & Related papers (2021-03-21T23:06:38Z) - MVFNet: Multi-View Fusion Network for Efficient Video Recognition [79.92736306354576]
We introduce a multi-view fusion (MVF) module to exploit video complexity using separable convolution for efficiency.
MVFNet can be thought of as a generalized video modeling framework.
arXiv Detail & Related papers (2020-12-13T06:34:18Z) - Approximated Bilinear Modules for Temporal Modeling [116.6506871576514]
Two-layers in CNNs can be converted to temporal bilinear modules by adding an auxiliary-branch sampling.
Our models can outperform most state-of-the-art methods on SomethingSomething v1 and v2 datasets without pretraining.
arXiv Detail & Related papers (2020-07-25T09:07:35Z) - A Real-time Action Representation with Temporal Encoding and Deep
Compression [115.3739774920845]
We propose a new real-time convolutional architecture, called Temporal Convolutional 3D Network (T-C3D), for action representation.
T-C3D learns video action representations in a hierarchical multi-granularity manner while obtaining a high process speed.
Our method achieves clear improvements on UCF101 action recognition benchmark against state-of-the-art real-time methods by 5.4% in terms of accuracy and 2 times faster in terms of inference speed with a less than 5MB storage model.
arXiv Detail & Related papers (2020-06-17T06:30:43Z) - STH: Spatio-Temporal Hybrid Convolution for Efficient Action Recognition [39.58542259261567]
We present a novel S-Temporal Hybrid Network (STH) which simultaneously encodes spatial and temporal video information with a small parameter.
Such a design enables efficient-temporal modeling and maintains a small model scale.
STH enjoys performance superiority over 3D CNNs while maintaining an even smaller parameter cost than 2D CNNs.
arXiv Detail & Related papers (2020-03-18T04:46:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.