Attention in Attention: Modeling Context Correlation for Efficient Video
Classification
- URL: http://arxiv.org/abs/2204.09303v1
- Date: Wed, 20 Apr 2022 08:37:52 GMT
- Title: Attention in Attention: Modeling Context Correlation for Efficient Video
Classification
- Authors: Yanbin Hao, Shuo Wang, Pei Cao, Xinjian Gao, Tong Xu, Jinmeng Wu and
Xiangnan He
- Abstract summary: This paper proposes an efficient attention-in-attention (AIA) method for focus-wise feature refinement.
We instantiate video feature contexts as dynamics aggregated along a specific axis with global average and pooling operations.
All the computational operations in attention units act on the pooled dimension, which results in quite few computational cost increase.
- Score: 47.938500236792244
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Attention mechanisms have significantly boosted the performance of video
classification neural networks thanks to the utilization of perspective
contexts. However, the current research on video attention generally focuses on
adopting a specific aspect of contexts (e.g., channel, spatial/temporal, or
global context) to refine the features and neglects their underlying
correlation when computing attentions. This leads to incomplete context
utilization and hence bears the weakness of limited performance improvement. To
tackle the problem, this paper proposes an efficient attention-in-attention
(AIA) method for element-wise feature refinement, which investigates the
feasibility of inserting the channel context into the spatio-temporal attention
learning module, referred to as CinST, and also its reverse variant, referred
to as STinC. Specifically, we instantiate the video feature contexts as
dynamics aggregated along a specific axis with global average and max pooling
operations. The workflow of an AIA module is that the first attention block
uses one kind of context information to guide the gating weights calculation of
the second attention that targets at the other context. Moreover, all the
computational operations in attention units act on the pooled dimension, which
results in quite few computational cost increase ($<$0.02\%). To verify our
method, we densely integrate it into two classical video network backbones and
conduct extensive experiments on several standard video classification
benchmarks. The source code of our AIA is available at
\url{https://github.com/haoyanbin918/Attention-in-Attention}.
Related papers
- HAtt-Flow: Hierarchical Attention-Flow Mechanism for Group Activity
Scene Graph Generation in Videos [8.10024991952397]
Group Activity Scene Graph (GASG) generation is a challenging task in computer vision.
We introduce a GASG dataset extending the JRDB dataset with nuanced annotations involving textitAppearance, Interaction, Position, Relationship, and Situation attributes.
We also introduce an innovative approach, textbfHierarchical textbfAttention-textbfFlow (HAtt-Flow) Mechanism, rooted in flow network theory to enhance GASG performance.
arXiv Detail & Related papers (2023-11-28T16:04:54Z) - Exploring Intra- and Inter-Video Relation for Surgical Semantic Scene
Segmentation [58.74791043631219]
We propose a novel framework STswinCL that explores the complementary intra- and inter-video relations to boost segmentation performance.
We extensively validate our approach on two public surgical video benchmarks, including EndoVis18 Challenge and CaDIS dataset.
Experimental results demonstrate the promising performance of our method, which consistently exceeds previous state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-29T05:52:23Z) - Group Contextualization for Video Recognition [80.3842253625557]
Group contextualization (GC) can boost the performance of 2D-CNN (e.g., TSN) and TSM.
GC embeds feature with four different kinds of contexts in parallel.
Group contextualization can boost the performance of 2D-CNN (e.g., TSN) to a level comparable to the state-the-art video networks.
arXiv Detail & Related papers (2022-03-18T01:49:40Z) - Video Salient Object Detection via Contrastive Features and Attention
Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection.
A co-attention formulation is utilized to combine the low-level and high-level features.
We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z) - Towards Accurate RGB-D Saliency Detection with Complementary Attention
and Adaptive Integration [20.006932559837516]
Saliency detection based on the complementary information from RGB images and depth maps has recently gained great popularity.
We propose Complementary Attention and Adaptive Integration Network (CAAI-Net) to integrate complementary attention based feature concentration and adaptive cross-modal feature fusion.
CAAI-Net is an effective saliency detection model and outperforms nine state-of-the-art models in terms of four widely-used metrics.
arXiv Detail & Related papers (2021-02-08T08:08:30Z) - Channelized Axial Attention for Semantic Segmentation [70.14921019774793]
We propose the Channelized Axial Attention (CAA) to seamlessly integratechannel attention and axial attention with reduced computationalcomplexity.
Our CAA not onlyrequires much less computation resources compared with otherdual attention models such as DANet, but also outperforms the state-of-the-art ResNet-101-based segmentation models on alltested datasets.
arXiv Detail & Related papers (2021-01-19T03:08:03Z) - Region-based Non-local Operation for Video Classification [11.746833714322154]
This paper presents region-based non-local (RNL) operations as a family of self-attention mechanisms.
By combining a channel attention module with the proposed RNL, we design an attention chain, which can be integrated into the off-the-shelf CNNs for end-to-end training.
The experimental results of our method outperform other attention mechanisms, and we achieve state-of-the-art performance on the Something-Something V1 dataset.
arXiv Detail & Related papers (2020-07-17T14:57:05Z) - See More, Know More: Unsupervised Video Object Segmentation with
Co-Attention Siamese Networks [184.4379622593225]
We introduce a novel network, called CO-attention Siamese Network (COSNet), to address the unsupervised video object segmentation task.
We emphasize the importance of inherent correlation among video frames and incorporate a global co-attention mechanism.
We propose a unified and end-to-end trainable framework where different co-attention variants can be derived for mining the rich context within videos.
arXiv Detail & Related papers (2020-01-19T11:10:39Z) - Hybrid Multiple Attention Network for Semantic Segmentation in Aerial
Images [24.35779077001839]
We propose a novel attention-based framework named Hybrid Multiple Attention Network (HMANet) to adaptively capture global correlations.
We introduce a simple yet effective region shuffle attention (RSA) module to reduce feature redundant and improve the efficiency of self-attention mechanism.
arXiv Detail & Related papers (2020-01-09T07:47:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.