AttentionNAS: Spatiotemporal Attention Cell Search for Video
Classification
- URL: http://arxiv.org/abs/2007.12034v2
- Date: Fri, 31 Jul 2020 04:25:23 GMT
- Title: AttentionNAS: Spatiotemporal Attention Cell Search for Video
Classification
- Authors: Xiaofang Wang, Xuehan Xiong, Maxim Neumann, AJ Piergiovanni, Michael
S. Ryoo, Anelia Angelova, Kris M. Kitani and Wei Hua
- Abstract summary: We propose a novel search space fortemporal attention cells, which allows the search algorithm to flexibly explore various design choices in the cell.
The discovered attention cells can be seamlessly inserted into existing backbone networks, e.g., I3D or S3D, and improve video accuracy by more than 2% on both Kinetics-600 and MiT datasets.
- Score: 86.64702967379709
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Convolutional operations have two limitations: (1) do not explicitly model
where to focus as the same filter is applied to all the positions, and (2) are
unsuitable for modeling long-range dependencies as they only operate on a small
neighborhood. While both limitations can be alleviated by attention operations,
many design choices remain to be determined to use attention, especially when
applying attention to videos. Towards a principled way of applying attention to
videos, we address the task of spatiotemporal attention cell search. We propose
a novel search space for spatiotemporal attention cells, which allows the
search algorithm to flexibly explore various design choices in the cell. The
discovered attention cells can be seamlessly inserted into existing backbone
networks, e.g., I3D or S3D, and improve video classification accuracy by more
than 2% on both Kinetics-600 and MiT datasets. The discovered attention cells
outperform non-local blocks on both datasets, and demonstrate strong
generalization across different modalities, backbones, and datasets. Inserting
our attention cells into I3D-R50 yields state-of-the-art performance on both
datasets.
Related papers
- Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for
Temporal Sentence Grounding [61.57847727651068]
Temporal sentence grounding aims to localize a target segment in an untrimmed video semantically according to a given sentence query.
Most previous works focus on learning frame-level features of each whole frame in the entire video, and directly match them with the textual information.
We propose a novel Motion- and Appearance-guided 3D Semantic Reasoning Network (MA3SRN), which incorporates optical-flow-guided motion-aware, detection-based appearance-aware, and 3D-aware object-level features.
arXiv Detail & Related papers (2022-03-06T13:57:09Z) - VoxelEmbed: 3D Instance Segmentation and Tracking with Voxel Embedding
based Deep Learning [5.434831972326107]
We propose a novel spatial-temporal voxel-embedding (VoxelEmbed) based learning method to perform simultaneous cell instance segmenting and tracking on 3D volumetric video sequences.
We evaluate our VoxelEmbed method on four 3D datasets (with different cell types) from the I SBI Cell Tracking Challenge.
arXiv Detail & Related papers (2021-06-22T02:03:26Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z) - Coordinate Attention for Efficient Mobile Network Design [96.40415345942186]
We propose a novel attention mechanism for mobile networks by embedding positional information into channel attention.
Unlike channel attention that transforms a feature tensor to a single feature vector via 2D global pooling, the coordinate attention factorizes channel attention into two 1D feature encoding processes.
Our coordinate attention is beneficial to ImageNet classification and behaves better in down-stream tasks, such as object detection and semantic segmentation.
arXiv Detail & Related papers (2021-03-04T09:18:02Z) - Self-supervised Video Representation Learning by Uncovering
Spatio-temporal Statistics [74.6968179473212]
This paper proposes a novel pretext task to address the self-supervised learning problem.
We compute a series of partitioning-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion.
A neural network is built and trained to yield the statistical summaries given the video frames as inputs.
arXiv Detail & Related papers (2020-08-31T08:31:56Z) - Cell Segmentation and Tracking using CNN-Based Distance Predictions and
a Graph-Based Matching Strategy [0.20999222360659608]
We present a method for the segmentation of touching cells in microscopy images.
By using a novel representation of cell borders, inspired by distance maps, our method is capable to utilize not only touching cells but also close cells in the training process.
This representation is notably robust to annotation errors and shows promising results for the segmentation of microscopy images containing in the training data underrepresented or not included cell types.
arXiv Detail & Related papers (2020-04-03T11:55:28Z) - Knowing What, Where and When to Look: Efficient Video Action Modeling
with Attention [84.83632045374155]
Attentive video modeling is essential for action recognition in unconstrained videos.
What-Where-When (W3) video attention module models all three facets of video attention jointly.
Experiments show that our attention model brings significant improvements to existing action recognition models.
arXiv Detail & Related papers (2020-04-02T21:48:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.