Knowing What, Where and When to Look: Efficient Video Action Modeling
with Attention
- URL: http://arxiv.org/abs/2004.01278v1
- Date: Thu, 2 Apr 2020 21:48:11 GMT
- Title: Knowing What, Where and When to Look: Efficient Video Action Modeling
with Attention
- Authors: Juan-Manuel Perez-Rua and Brais Martinez and Xiatian Zhu and Antoine
Toisoul and Victor Escorcia and Tao Xiang
- Abstract summary: Attentive video modeling is essential for action recognition in unconstrained videos.
What-Where-When (W3) video attention module models all three facets of video attention jointly.
Experiments show that our attention model brings significant improvements to existing action recognition models.
- Score: 84.83632045374155
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Attentive video modeling is essential for action recognition in unconstrained
videos due to their rich yet redundant information over space and time.
However, introducing attention in a deep neural network for action recognition
is challenging for two reasons. First, an effective attention module needs to
learn what (objects and their local motion patterns), where (spatially), and
when (temporally) to focus on. Second, a video attention module must be
efficient because existing action recognition models already suffer from high
computational cost. To address both challenges, a novel What-Where-When (W3)
video attention module is proposed. Departing from existing alternatives, our
W3 module models all three facets of video attention jointly. Crucially, it is
extremely efficient by factorizing the high-dimensional video feature data into
low-dimensional meaningful spaces (1D channel vector for `what' and 2D spatial
tensors for `where'), followed by lightweight temporal attention reasoning.
Extensive experiments show that our attention model brings significant
improvements to existing action recognition models, achieving new
state-of-the-art performance on a number of benchmarks.
Related papers
- Flatten: Video Action Recognition is an Image Classification task [15.518011818978074]
A novel video representation architecture, Flatten, serves as a plug-and-play module that can be seamlessly integrated into any image-understanding network.
Experiments on commonly used datasets have demonstrated that embedding Flatten provides significant performance improvements over original model.
arXiv Detail & Related papers (2024-08-17T14:59:58Z) - Action Recognition with Multi-stream Motion Modeling and Mutual
Information Maximization [44.73161606369333]
Action recognition is a fundamental and intriguing problem in artificial intelligence.
We introduce a novel Stream-GCN network equipped with multi-stream components and channel attention.
Our approach sets the new state-of-the-art performance on three benchmark datasets.
arXiv Detail & Related papers (2023-06-13T06:56:09Z) - Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for
Temporal Sentence Grounding [61.57847727651068]
Temporal sentence grounding aims to localize a target segment in an untrimmed video semantically according to a given sentence query.
Most previous works focus on learning frame-level features of each whole frame in the entire video, and directly match them with the textual information.
We propose a novel Motion- and Appearance-guided 3D Semantic Reasoning Network (MA3SRN), which incorporates optical-flow-guided motion-aware, detection-based appearance-aware, and 3D-aware object-level features.
arXiv Detail & Related papers (2022-03-06T13:57:09Z) - Video Salient Object Detection via Contrastive Features and Attention
Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection.
A co-attention formulation is utilized to combine the low-level and high-level features.
We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z) - Efficient Spatialtemporal Context Modeling for Action Recognition [42.30158166919919]
We propose a recurrent 3D criss-cross attention (RCCA-3D) module to model the dense long-range contextual information video for action recognition.
We model the relationship between points in the same line along the direction of horizon, vertical and depth at each time, which forms a 3D criss-cross structure.
Compared with the non-local method, the proposed RCCA-3D module reduces the number of parameters and FLOPs by 25% and 11% for the video context modeling.
arXiv Detail & Related papers (2021-03-20T14:48:12Z) - GTA: Global Temporal Attention for Video Action Understanding [51.476605514802806]
We introduce Global Temporal Attention (AGT), which performs global temporal attention on top of spatial attention in a decoupled manner.
Tests on 2D and 3D networks demonstrate that our approach consistently enhances temporal modeling and provides state-of-the-art performance on three video action recognition datasets.
arXiv Detail & Related papers (2020-12-15T18:58:21Z) - A Comprehensive Study of Deep Video Action Recognition [35.7068977497202]
Video action recognition is one of the representative tasks for video understanding.
We provide a comprehensive survey of over 200 existing papers on deep learning for video action recognition.
arXiv Detail & Related papers (2020-12-11T18:54:08Z) - Depth Guided Adaptive Meta-Fusion Network for Few-shot Video Recognition [86.31412529187243]
Few-shot video recognition aims at learning new actions with only very few labeled samples.
We propose a depth guided Adaptive Meta-Fusion Network for few-shot video recognition which is termed as AMeFu-Net.
arXiv Detail & Related papers (2020-10-20T03:06:20Z) - AttentionNAS: Spatiotemporal Attention Cell Search for Video
Classification [86.64702967379709]
We propose a novel search space fortemporal attention cells, which allows the search algorithm to flexibly explore various design choices in the cell.
The discovered attention cells can be seamlessly inserted into existing backbone networks, e.g., I3D or S3D, and improve video accuracy by more than 2% on both Kinetics-600 and MiT datasets.
arXiv Detail & Related papers (2020-07-23T14:30:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.