Prior-enhanced Temporal Action Localization using Subject-aware Spatial
Attention
- URL: http://arxiv.org/abs/2211.05299v1
- Date: Thu, 10 Nov 2022 02:27:30 GMT
- Title: Prior-enhanced Temporal Action Localization using Subject-aware Spatial
Attention
- Authors: Yifan Liu and Youbao Tang and Ning Zhang and Ruei-Sung Lin and Haoqian
Wang
- Abstract summary: Temporal action localization (TAL) aims to detect the boundary and identify the class of each action instance in a long untrimmed video.
Current approaches treat video frames homogeneously, and tend to give background and key objects excessive attention.
We propose a prior-enhanced temporal action localization method (PETAL), which only takes in RGB input and incorporates action subjects as priors.
- Score: 26.74864808534721
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal action localization (TAL) aims to detect the boundary and identify
the class of each action instance in a long untrimmed video. Current approaches
treat video frames homogeneously, and tend to give background and key objects
excessive attention. This limits their sensitivity to localize action
boundaries. To this end, we propose a prior-enhanced temporal action
localization method (PETAL), which only takes in RGB input and incorporates
action subjects as priors. This proposal leverages action subjects' information
with a plug-and-play subject-aware spatial attention module (SA-SAM) to
generate an aggregated and subject-prioritized representation. Experimental
results on THUMOS-14 and ActivityNet-1.3 datasets demonstrate that the proposed
PETAL achieves competitive performance using only RGB features, e.g., boosting
mAP by 2.41% or 0.25% over the state-of-the-art approach that uses RGB features
or with additional optical flow features on the THUMOS-14 dataset.
Related papers
- Background Activation Suppression for Weakly Supervised Object
Localization and Semantic Segmentation [84.62067728093358]
Weakly supervised object localization and semantic segmentation aim to localize objects using only image-level labels.
New paradigm has emerged by generating a foreground prediction map to achieve pixel-level localization.
This paper presents two astonishing experimental observations on the object localization learning process.
arXiv Detail & Related papers (2023-09-22T15:44:10Z) - Event-Free Moving Object Segmentation from Moving Ego Vehicle [88.33470650615162]
Moving object segmentation (MOS) in dynamic scenes is an important, challenging, but under-explored research topic for autonomous driving.
Most segmentation methods leverage motion cues obtained from optical flow maps.
We propose to exploit event cameras for better video understanding, which provide rich motion cues without relying on optical flow.
arXiv Detail & Related papers (2023-04-28T23:43:10Z) - Decomposed Cross-modal Distillation for RGB-based Temporal Action
Detection [23.48709176879878]
Temporal action detection aims to predict the time intervals and the classes of action instances in the video.
Existing two-stream models exhibit slow inference speed due to their reliance on computationally expensive optical flow.
We introduce a cross-modal distillation framework to build a strong RGB-based detector by transferring knowledge of the motion modality.
arXiv Detail & Related papers (2023-03-30T10:47:26Z) - Adaptive Multi-source Predictor for Zero-shot Video Object Segmentation [68.56443382421878]
We propose a novel adaptive multi-source predictor for zero-shot video object segmentation (ZVOS)
In the static object predictor, the RGB source is converted to depth and static saliency sources, simultaneously.
Experiments show that the proposed model outperforms the state-of-the-art methods on three challenging ZVOS benchmarks.
arXiv Detail & Related papers (2023-03-18T10:19:29Z) - Temporal Action Localization with Multi-temporal Scales [54.69057924183867]
We propose to predict actions on a feature space of multi-temporal scales.
Specifically, we use refined feature pyramids of different scales to pass semantics from high-level scales to low-level scales.
The proposed method can achieve improvements of 12.6%, 17.4% and 2.2%, respectively.
arXiv Detail & Related papers (2022-08-16T01:48:23Z) - End-to-End Semi-Supervised Learning for Video Action Detection [23.042410033982193]
We propose a simple end-to-end based approach effectively which utilizes the unlabeled data.
Video action detection requires both, action class prediction as well as a-temporal consistency.
We demonstrate the effectiveness of the proposed approach on two different action detection benchmark datasets.
arXiv Detail & Related papers (2022-03-08T18:11:25Z) - Towards High-Quality Temporal Action Detection with Sparse Proposals [14.923321325749196]
Temporal Action Detection aims to localize the temporal segments containing human action instances and predict the action categories.
We introduce Sparse Proposals to interact with the hierarchical features.
Experiments demonstrate the effectiveness of our method, especially under high tIoU thresholds.
arXiv Detail & Related papers (2021-09-18T06:15:19Z) - Weakly Supervised Temporal Action Localization Through Learning Explicit
Subspaces for Action and Context [151.23835595907596]
Methods learn to localize temporal starts and ends of action instances in a video under only video-level supervision.
We introduce a framework that learns two feature subspaces respectively for actions and their context.
The proposed approach outperforms state-of-the-art WS-TAL methods on three benchmarks.
arXiv Detail & Related papers (2021-03-30T08:26:53Z) - Learning Salient Boundary Feature for Anchor-free Temporal Action
Localization [81.55295042558409]
Temporal action localization is an important yet challenging task in video understanding.
We propose the first purely anchor-free temporal localization method.
Our model includes (i) an end-to-end trainable basic predictor, (ii) a saliency-based refinement module, and (iii) several consistency constraints.
arXiv Detail & Related papers (2021-03-24T12:28:32Z) - A Hybrid Attention Mechanism for Weakly-Supervised Temporal Action
Localization [12.353250130848044]
We present a novel framework named HAM-Net with a hybrid attention mechanism which includes temporal soft, semi-soft and hard attentions.
Our proposed approach outperforms recent state-of-the-art methods by at least 2.2% mAP at IoU threshold 0.5 on the THUMOS14 dataset.
arXiv Detail & Related papers (2021-01-03T03:08:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.