Learning Comprehensive Motion Representation for Action Recognition
- URL: http://arxiv.org/abs/2103.12278v1
- Date: Tue, 23 Mar 2021 03:06:26 GMT
- Title: Learning Comprehensive Motion Representation for Action Recognition
- Authors: Mingyu Wu, Boyuan Jiang, Donghao Luo, Junchi Yan, Yabiao Wang, Ying
Tai, Chengjie Wang, Jilin Li, Feiyue Huang, Xiaokang Yang
- Abstract summary: 2D CNN-based methods are efficient but may yield redundant features due to applying the same 2D convolution kernel to each frame.
Recent efforts attempt to capture motion information by establishing inter-frame connections while still suffering the limited temporal receptive field or high latency.
We propose a Channel-wise Motion Enhancement (CME) module to adaptively emphasize the channels related to dynamic information with a channel-wise gate vector.
We also propose a Spatial-wise Motion Enhancement (SME) module to focus on the regions with the critical target in motion, according to the point-to-point similarity between adjacent feature maps.
- Score: 124.65403098534266
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: For action recognition learning, 2D CNN-based methods are efficient but may
yield redundant features due to applying the same 2D convolution kernel to each
frame. Recent efforts attempt to capture motion information by establishing
inter-frame connections while still suffering the limited temporal receptive
field or high latency. Moreover, the feature enhancement is often only
performed by channel or space dimension in action recognition. To address these
issues, we first devise a Channel-wise Motion Enhancement (CME) module to
adaptively emphasize the channels related to dynamic information with a
channel-wise gate vector. The channel gates generated by CME incorporate the
information from all the other frames in the video. We further propose a
Spatial-wise Motion Enhancement (SME) module to focus on the regions with the
critical target in motion, according to the point-to-point similarity between
adjacent feature maps. The intuition is that the change of background is
typically slower than the motion area. Both CME and SME have clear physical
meaning in capturing action clues. By integrating the two modules into the
off-the-shelf 2D network, we finally obtain a Comprehensive Motion
Representation (CMR) learning method for action recognition, which achieves
competitive performance on Something-Something V1 & V2 and Kinetics-400. On the
temporal reasoning datasets Something-Something V1 and V2, our method
outperforms the current state-of-the-art by 2.3% and 1.9% when using 16 frames
as input, respectively.
Related papers
- Action Recognition with Multi-stream Motion Modeling and Mutual
Information Maximization [44.73161606369333]
Action recognition is a fundamental and intriguing problem in artificial intelligence.
We introduce a novel Stream-GCN network equipped with multi-stream components and channel attention.
Our approach sets the new state-of-the-art performance on three benchmark datasets.
arXiv Detail & Related papers (2023-06-13T06:56:09Z) - Representation Learning for Compressed Video Action Recognition via
Attentive Cross-modal Interaction with Motion Enhancement [28.570085937225976]
This paper proposes a novel framework, namely Attentive Cross-modal Interaction Network with Motion Enhancement.
It follows the two-stream architecture, i.e. one for the RGB modality and the other for the motion modality.
Experiments on the UCF-101, HMDB-51 and Kinetics-400 benchmarks demonstrate the effectiveness and efficiency of MEACI-Net.
arXiv Detail & Related papers (2022-05-07T06:26:49Z) - EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content.
First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events.
Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z) - TSI: Temporal Saliency Integration for Video Action Recognition [32.18535820790586]
We propose a Temporal Saliency Integration (TSI) block, which mainly contains a Salient Motion Excitation (SME) module and a Cross-scale Temporal Integration (CTI) module.
SME aims to highlight the motion-sensitive area through local-global motion modeling.
CTI is designed to perform multi-scale temporal modeling through a group of separate 1D convolutions respectively.
arXiv Detail & Related papers (2021-06-02T11:43:49Z) - Learning to Segment Rigid Motions from Two Frames [72.14906744113125]
We propose a modular network, motivated by a geometric analysis of what independent object motions can be recovered from an egomotion field.
It takes two consecutive frames as input and predicts segmentation masks for the background and multiple rigidly moving objects, which are then parameterized by 3D rigid transformations.
Our method achieves state-of-the-art performance for rigid motion segmentation on KITTI and Sintel.
arXiv Detail & Related papers (2021-01-11T04:20:30Z) - Residual Frames with Efficient Pseudo-3D CNN for Human Action
Recognition [10.185425416255294]
We propose to use residual frames as an alternative "lightweight" motion representation.
We also develop a new pseudo-3D convolution module which decouples 3D convolution into 2D and 1D convolution.
arXiv Detail & Related papers (2020-08-03T17:40:17Z) - Temporal Distinct Representation Learning for Action Recognition [139.93983070642412]
Two-Dimensional Convolutional Neural Network (2D CNN) is used to characterize videos.
Different frames of a video share the same 2D CNN kernels, which may result in repeated and redundant information utilization.
We propose a sequential channel filtering mechanism to excite the discriminative channels of features from different frames step by step, and thus avoid repeated information extraction.
Our method is evaluated on benchmark temporal reasoning datasets Something-Something V1 and V2, and it achieves visible improvements over the best competitor by 2.4% and 1.3%, respectively.
arXiv Detail & Related papers (2020-07-15T11:30:40Z) - Motion-Attentive Transition for Zero-Shot Video Object Segmentation [99.44383412488703]
We present a Motion-Attentive Transition Network (MATNet) for zero-shot object segmentation.
An asymmetric attention block, called Motion-Attentive Transition (MAT), is designed within a two-stream encoder.
In this way, the encoder becomes deeply internative, allowing for closely hierarchical interactions between object motion and appearance.
arXiv Detail & Related papers (2020-03-09T16:58:42Z) - Actions as Moving Points [66.21507857877756]
We present a conceptually simple, efficient, and more precise action tubelet detection framework, termed as MovingCenter Detector (MOC-detector)
Based on the insight that movement information could simplify and assist action tubelet detection, our MOC-detector is composed of three crucial head branches.
Our MOC-detector outperforms the existing state-of-the-art methods for both metrics of frame-mAP and video-mAP on the JHMDB and UCF101-24 datasets.
arXiv Detail & Related papers (2020-01-14T03:29:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.