A Hybrid Attention Mechanism for Weakly-Supervised Temporal Action
Localization
- URL: http://arxiv.org/abs/2101.00545v3
- Date: Wed, 24 Mar 2021 23:15:35 GMT
- Title: A Hybrid Attention Mechanism for Weakly-Supervised Temporal Action
Localization
- Authors: Ashraful Islam, Chengjiang Long, Richard Radke
- Abstract summary: We present a novel framework named HAM-Net with a hybrid attention mechanism which includes temporal soft, semi-soft and hard attentions.
Our proposed approach outperforms recent state-of-the-art methods by at least 2.2% mAP at IoU threshold 0.5 on the THUMOS14 dataset.
- Score: 12.353250130848044
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Weakly supervised temporal action localization is a challenging vision task
due to the absence of ground-truth temporal locations of actions in the
training videos. With only video-level supervision during training, most
existing methods rely on a Multiple Instance Learning (MIL) framework to
predict the start and end frame of each action category in a video. However,
the existing MIL-based approach has a major limitation of only capturing the
most discriminative frames of an action, ignoring the full extent of an
activity. Moreover, these methods cannot model background activity effectively,
which plays an important role in localizing foreground activities. In this
paper, we present a novel framework named HAM-Net with a hybrid attention
mechanism which includes temporal soft, semi-soft and hard attentions to
address these issues. Our temporal soft attention module, guided by an
auxiliary background class in the classification module, models the background
activity by introducing an "action-ness" score for each video snippet.
Moreover, our temporal semi-soft and hard attention modules, calculating two
attention scores for each video snippet, help to focus on the less
discriminative frames of an action to capture the full action boundary. Our
proposed approach outperforms recent state-of-the-art methods by at least 2.2%
mAP at IoU threshold 0.5 on the THUMOS14 dataset, and by at least 1.3% mAP at
IoU threshold 0.75 on the ActivityNet1.2 dataset. Code can be found at:
https://github.com/asrafulashiq/hamnet.
Related papers
- Open-Vocabulary Spatio-Temporal Action Detection [59.91046192096296]
Open-vocabulary-temporal action detection (OV-STAD) is an important fine-grained video understanding task.
OV-STAD requires training a model on a limited set of base classes with box and label supervision.
To better adapt the holistic VLM for the fine-grained action detection task, we carefully fine-tune it on the localized video region-text pairs.
arXiv Detail & Related papers (2024-05-17T14:52:47Z) - Proposal-based Temporal Action Localization with Point-level Supervision [29.98225940694062]
Point-level supervised temporal action localization (PTAL) aims at recognizing and localizing actions in untrimmed videos.
We propose a novel method that localizes actions by generating and evaluating action proposals of flexible duration.
Experiments show that our proposed method achieves competitive or superior performance to the state-of-the-art methods.
arXiv Detail & Related papers (2023-10-09T08:27:05Z) - Progression-Guided Temporal Action Detection in Videos [20.02711550239915]
We present a novel framework, Action Progression Network (APN), for temporal action detection (TAD) in videos.
The framework locates actions in videos by detecting the action evolution process.
We quantify a complete action process into 101 ordered stages and train a neural network to recognize the action progressions.
arXiv Detail & Related papers (2023-08-18T03:14:05Z) - Foreground-Action Consistency Network for Weakly Supervised Temporal
Action Localization [66.66545680550782]
We present a framework named FAC-Net, on which three branches are appended, named class-wise foreground classification branch, class-agnostic attention branch and multiple instance learning branch.
First, our class-wise foreground classification branch regularizes the relation between actions and foreground to maximize the foreground-background separation.
Besides, the class-agnostic attention branch and multiple instance learning branch are adopted to regularize the foreground-action consistency and help to learn a meaningful foreground.
arXiv Detail & Related papers (2021-08-14T12:34:44Z) - Action Unit Memory Network for Weakly Supervised Temporal Action
Localization [124.61981738536642]
Weakly supervised temporal action localization aims to detect and localize actions in untrimmed videos with only video-level labels during training.
We present an Action Unit Memory Network (AUMN) for weakly supervised temporal action localization.
arXiv Detail & Related papers (2021-04-29T06:19:44Z) - ACM-Net: Action Context Modeling Network for Weakly-Supervised Temporal
Action Localization [18.56421375743287]
We propose an action-context modeling network termed ACM-Net.
It integrates a three-branch attention module to measure the likelihood of each temporal point being action instance, context, or non-action background, simultaneously.
Our method can outperform current state-of-the-art methods, and even achieve comparable performance with fully-supervised methods.
arXiv Detail & Related papers (2021-04-07T07:39:57Z) - Weakly Supervised Temporal Action Localization Through Learning Explicit
Subspaces for Action and Context [151.23835595907596]
Methods learn to localize temporal starts and ends of action instances in a video under only video-level supervision.
We introduce a framework that learns two feature subspaces respectively for actions and their context.
The proposed approach outperforms state-of-the-art WS-TAL methods on three benchmarks.
arXiv Detail & Related papers (2021-03-30T08:26:53Z) - Learning Salient Boundary Feature for Anchor-free Temporal Action
Localization [81.55295042558409]
Temporal action localization is an important yet challenging task in video understanding.
We propose the first purely anchor-free temporal localization method.
Our model includes (i) an end-to-end trainable basic predictor, (ii) a saliency-based refinement module, and (iii) several consistency constraints.
arXiv Detail & Related papers (2021-03-24T12:28:32Z) - Revisiting Few-shot Activity Detection with Class Similarity Control [107.79338380065286]
We present a framework for few-shot temporal activity detection based on proposal regression.
Our model is end-to-end trainable, takes into account the frame rate differences between few-shot activities and untrimmed test videos, and can benefit from additional few-shot examples.
arXiv Detail & Related papers (2020-03-31T22:02:38Z) - Weakly Supervised Temporal Action Localization Using Deep Metric
Learning [12.49814373580862]
We propose a weakly supervised temporal action localization method that only requires video-level action instances as supervision during training.
We jointly optimize a balanced binary cross-entropy loss and a metric loss using a standard backpropagation algorithm.
Our approach improves the current state-of-the-art result for THUMOS14 by 6.5% mAP at IoU threshold 0.5, and achieves competitive performance for ActivityNet1.2.
arXiv Detail & Related papers (2020-01-21T22:01:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.