M2A: Motion Aware Attention for Accurate Video Action Recognition
- URL: http://arxiv.org/abs/2111.09976v1
- Date: Thu, 18 Nov 2021 23:38:09 GMT
- Title: M2A: Motion Aware Attention for Accurate Video Action Recognition
- Authors: Brennan Gebotys, Alexander Wong, David A. Clausi
- Abstract summary: We develop a new attention mechanism called Motion Aware Attention (M2A) that explicitly incorporates motion characteristics.
M2A extracts motion information between consecutive frames and utilizes attention to focus on the motion patterns found across frames to accurately recognize actions in videos.
We show that incorporating motion mechanisms with attention mechanisms using the proposed M2A mechanism can lead to a +15% to +26% improvement in top-1 accuracy across different backbone architectures.
- Score: 86.67413715815744
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Advancements in attention mechanisms have led to significant performance
improvements in a variety of areas in machine learning due to its ability to
enable the dynamic modeling of temporal sequences. A particular area in
computer vision that is likely to benefit greatly from the incorporation of
attention mechanisms in video action recognition. However, much of the current
research's focus on attention mechanisms have been on spatial and temporal
attention, which are unable to take advantage of the inherent motion found in
videos. Motivated by this, we develop a new attention mechanism called Motion
Aware Attention (M2A) that explicitly incorporates motion characteristics. More
specifically, M2A extracts motion information between consecutive frames and
utilizes attention to focus on the motion patterns found across frames to
accurately recognize actions in videos. The proposed M2A mechanism is simple to
implement and can be easily incorporated into any neural network backbone
architecture. We show that incorporating motion mechanisms with attention
mechanisms using the proposed M2A mechanism can lead to a +15% to +26%
improvement in top-1 accuracy across different backbone architectures, with
only a small increase in computational complexity. We further compared the
performance of M2A with other state-of-the-art motion and attention mechanisms
on the Something-Something V1 video action recognition benchmark. Experimental
results showed that M2A can lead to further improvements when combined with
other temporal mechanisms and that it outperforms other motion-only or
attention-only mechanisms by as much as +60% in top-1 accuracy for specific
classes in the benchmark.
Related papers
- MotionCLR: Motion Generation and Training-free Editing via Understanding Attention Mechanisms [12.621553130655945]
We develop a versatile set of simple yet effective motion editing methods via manipulating attention maps.
Our method enjoys good generation and editing ability with good explainability.
arXiv Detail & Related papers (2024-10-24T17:59:45Z) - Assessing the Impact of Attention and Self-Attention Mechanisms on the
Classification of Skin Lesions [0.0]
We focus on two forms of attention mechanisms: attention modules and self-attention.
Attention modules are used to reweight the features of each layer input tensor.
Self-Attention, originally proposed in the area of Natural Language Processing makes it possible to relate all the items in an input sequence.
arXiv Detail & Related papers (2021-12-23T18:02:48Z) - Attention Mechanisms in Computer Vision: A Survey [75.6074182122423]
We provide a comprehensive review of various attention mechanisms in computer vision.
We categorize them according to approach, such as channel attention, spatial attention, temporal attention and branch attention.
We suggest future directions for attention mechanism research.
arXiv Detail & Related papers (2021-11-15T09:18:40Z) - MotionHint: Self-Supervised Monocular Visual Odometry with Motion
Constraints [70.76761166614511]
We present a novel self-supervised algorithm named MotionHint for monocular visual odometry (VO)
Our MotionHint algorithm can be easily applied to existing open-sourced state-of-the-art SSM-VO systems.
arXiv Detail & Related papers (2021-09-14T15:35:08Z) - Class Semantics-based Attention for Action Detection [10.69685258736244]
Action localization networks are often structured as a feature encoder sub-network and a localization sub-network.
We propose a novel attention mechanism, the Class Semantics-based Attention (CSA), that learns from the temporal distribution of semantics of action classes present in an input video.
Our attention mechanism outperforms prior self-attention modules such as the squeeze-and-excitation in action detection task.
arXiv Detail & Related papers (2021-09-06T17:22:46Z) - Self-supervised Video Object Segmentation by Motion Grouping [79.13206959575228]
We develop a computer vision system able to segment objects by exploiting motion cues.
We introduce a simple variant of the Transformer to segment optical flow frames into primary objects and the background.
We evaluate the proposed architecture on public benchmarks (DAVIS2016, SegTrackv2, and FBMS59)
arXiv Detail & Related papers (2021-04-15T17:59:32Z) - Learning Comprehensive Motion Representation for Action Recognition [124.65403098534266]
2D CNN-based methods are efficient but may yield redundant features due to applying the same 2D convolution kernel to each frame.
Recent efforts attempt to capture motion information by establishing inter-frame connections while still suffering the limited temporal receptive field or high latency.
We propose a Channel-wise Motion Enhancement (CME) module to adaptively emphasize the channels related to dynamic information with a channel-wise gate vector.
We also propose a Spatial-wise Motion Enhancement (SME) module to focus on the regions with the critical target in motion, according to the point-to-point similarity between adjacent feature maps.
arXiv Detail & Related papers (2021-03-23T03:06:26Z) - Knowing What, Where and When to Look: Efficient Video Action Modeling
with Attention [84.83632045374155]
Attentive video modeling is essential for action recognition in unconstrained videos.
What-Where-When (W3) video attention module models all three facets of video attention jointly.
Experiments show that our attention model brings significant improvements to existing action recognition models.
arXiv Detail & Related papers (2020-04-02T21:48:11Z) - Is Attention All What You Need? -- An Empirical Investigation on
Convolution-Based Active Memory and Self-Attention [7.967230034960396]
We evaluate whether various active-memory mechanisms could replace self-attention in a Transformer.
Experiments suggest that active-memory alone achieves comparable results to the self-attention mechanism for language modelling.
For some specific algorithmic tasks, active-memory mechanisms alone outperform both self-attention and a combination of the two.
arXiv Detail & Related papers (2019-12-27T02:01:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.