Class Semantics-based Attention for Action Detection
- URL: http://arxiv.org/abs/2109.02613v1
- Date: Mon, 6 Sep 2021 17:22:46 GMT
- Title: Class Semantics-based Attention for Action Detection
- Authors: Deepak Sridhar, Niamul Quader, Srikanth Muralidharan, Yaoxin Li, Peng
Dai, Juwei Lu
- Abstract summary: Action localization networks are often structured as a feature encoder sub-network and a localization sub-network.
We propose a novel attention mechanism, the Class Semantics-based Attention (CSA), that learns from the temporal distribution of semantics of action classes present in an input video.
Our attention mechanism outperforms prior self-attention modules such as the squeeze-and-excitation in action detection task.
- Score: 10.69685258736244
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Action localization networks are often structured as a feature encoder
sub-network and a localization sub-network, where the feature encoder learns to
transform an input video to features that are useful for the localization
sub-network to generate reliable action proposals. While some of the encoded
features may be more useful for generating action proposals, prior action
localization approaches do not include any attention mechanism that enables the
localization sub-network to attend more to the more important features. In this
paper, we propose a novel attention mechanism, the Class Semantics-based
Attention (CSA), that learns from the temporal distribution of semantics of
action classes present in an input video to find the importance scores of the
encoded features, which are used to provide attention to the more useful
encoded features. We demonstrate on two popular action detection datasets that
incorporating our novel attention mechanism provides considerable performance
gains on competitive action detection models (e.g., around 6.2% improvement
over BMN action detection baseline to obtain 47.5% mAP on the THUMOS-14
dataset), and a new state-of-the-art of 36.25% mAP on the ActivityNet v1.3
dataset. Further, the CSA localization model family which includes BMN-CSA, was
part of the second-placed submission at the 2021 ActivityNet action
localization challenge. Our attention mechanism outperforms prior
self-attention modules such as the squeeze-and-excitation in action detection
task. We also observe that our attention mechanism is complementary to such
self-attention modules in that performance improvements are seen when both are
used together.
Related papers
- Dual Relation Mining Network for Zero-Shot Learning [48.89161627050706]
We propose a Dual Relation Mining Network (DRMN) to enable effective visual-semantic interactions and learn semantic relationship among attributes for knowledge transfer.
Specifically, we introduce a Dual Attention Block (DAB) for visual-semantic relationship mining, which enriches visual information by multi-level feature fusion.
For semantic relationship modeling, we utilize a Semantic Interaction Transformer (SIT) to enhance the generalization of attribute representations among images.
arXiv Detail & Related papers (2024-05-06T16:31:19Z) - LAA-Net: Localized Artifact Attention Network for Quality-Agnostic and Generalizable Deepfake Detection [12.567069964305265]
This paper introduces a novel approach for high-quality deepfake detection called Localized Artifact Attention Network (LAA-Net)
Existing methods for high-quality deepfake detection are mainly based on a supervised binary classifier coupled with an implicit attention mechanism.
Experiments performed on several benchmarks show the superiority of our approach in terms of Area Under the Curve (AUC) and Average Precision (AP)
arXiv Detail & Related papers (2024-01-24T23:42:08Z) - Learning to Refactor Action and Co-occurrence Features for Temporal
Action Localization [74.74339878286935]
Action features and co-occurrence features often dominate the actual action content in videos.
We develop a novel auxiliary task by decoupling these two types of features within a video snippet.
We term our method RefactorNet, which first explicitly factorizes the action content and regularizes its co-occurrence features.
arXiv Detail & Related papers (2022-06-23T06:30:08Z) - TDAN: Top-Down Attention Networks for Enhanced Feature Selectivity in
CNNs [18.24779045808196]
We propose a lightweight top-down (TD) attention module that iteratively generates a "visual searchlight" to perform top-down channel and spatial modulation of its inputs.
Our models are more robust to changes in input resolution during inference and learn to "shift attention" by localizing individual objects or features at each computation step without any explicit supervision.
arXiv Detail & Related papers (2021-11-26T12:35:17Z) - Cross-modal Consensus Network for Weakly Supervised Temporal Action
Localization [74.34699679568818]
Weakly supervised temporal action localization (WS-TAL) is a challenging task that aims to localize action instances in the given video with video-level categorical supervision.
We propose a cross-modal consensus network (CO2-Net) to tackle this problem.
arXiv Detail & Related papers (2021-07-27T04:21:01Z) - Action Unit Memory Network for Weakly Supervised Temporal Action
Localization [124.61981738536642]
Weakly supervised temporal action localization aims to detect and localize actions in untrimmed videos with only video-level labels during training.
We present an Action Unit Memory Network (AUMN) for weakly supervised temporal action localization.
arXiv Detail & Related papers (2021-04-29T06:19:44Z) - Coordinate Attention for Efficient Mobile Network Design [96.40415345942186]
We propose a novel attention mechanism for mobile networks by embedding positional information into channel attention.
Unlike channel attention that transforms a feature tensor to a single feature vector via 2D global pooling, the coordinate attention factorizes channel attention into two 1D feature encoding processes.
Our coordinate attention is beneficial to ImageNet classification and behaves better in down-stream tasks, such as object detection and semantic segmentation.
arXiv Detail & Related papers (2021-03-04T09:18:02Z) - One Point is All You Need: Directional Attention Point for Feature
Learning [51.44837108615402]
We present a novel attention-based mechanism for learning enhanced point features for tasks such as point cloud classification and segmentation.
We show that our attention mechanism can be easily incorporated into state-of-the-art point cloud classification and segmentation networks.
arXiv Detail & Related papers (2020-12-11T11:45:39Z) - Attention-Guided Network for Iris Presentation Attack Detection [13.875545441867137]
We propose attention-guided iris presentation attack detection (AG-PAD) to augment CNNs with attention mechanisms.
Experiments involving both a JHU-APL proprietary dataset and the benchmark LivDet-Iris-2017 dataset suggest that the proposed method achieves promising results.
arXiv Detail & Related papers (2020-10-23T19:23:51Z) - Attention as Activation [4.265244011052538]
We propose a novel type of activation units called attentional activation (ATAC) units as a unification of activation functions and attention mechanisms.
By replacing the well-known rectified linear units by such ATAC units in convolutional networks, we can construct fully attentional networks that perform significantly better.
arXiv Detail & Related papers (2020-07-15T14:52:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.