Progression-Guided Temporal Action Detection in Videos
- URL: http://arxiv.org/abs/2308.09268v1
- Date: Fri, 18 Aug 2023 03:14:05 GMT
- Title: Progression-Guided Temporal Action Detection in Videos
- Authors: Chongkai Lu, Man-Wai Mak, Ruimin Li, Zheru Chi, Hong Fu
- Abstract summary: We present a novel framework, Action Progression Network (APN), for temporal action detection (TAD) in videos.
The framework locates actions in videos by detecting the action evolution process.
We quantify a complete action process into 101 ordered stages and train a neural network to recognize the action progressions.
- Score: 20.02711550239915
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a novel framework, Action Progression Network (APN), for temporal
action detection (TAD) in videos. The framework locates actions in videos by
detecting the action evolution process. To encode the action evolution, we
quantify a complete action process into 101 ordered stages (0\%, 1\%, ...,
100\%), referred to as action progressions. We then train a neural network to
recognize the action progressions. The framework detects action boundaries by
detecting complete action processes in the videos, e.g., a video segment with
detected action progressions closely follow the sequence 0\%, 1\%, ..., 100\%.
The framework offers three major advantages: (1) Our neural networks are
trained end-to-end, contrasting conventional methods that optimize modules
separately; (2) The APN is trained using action frames exclusively, enabling
models to be trained on action classification datasets and robust to videos
with temporal background styles differing from those in training; (3) Our
framework effectively avoids detecting incomplete actions and excels in
detecting long-lasting actions due to the fine-grained and explicit encoding of
the temporal structure of actions. Leveraging these advantages, the APN
achieves competitive performance and significantly surpasses its counterparts
in detecting long-lasting actions. With an IoU threshold of 0.5, the APN
achieves a mean Average Precision (mAP) of 58.3\% on the THUMOS14 dataset and
98.9\% mAP on the DFMAD70 dataset.
Related papers
- DOAD: Decoupled One Stage Action Detection Network [77.14883592642782]
Localizing people and recognizing their actions from videos is a challenging task towards high-level video understanding.
Existing methods are mostly two-stage based, with one stage for person bounding box generation and the other stage for action recognition.
We present a decoupled one-stage network dubbed DOAD, to improve the efficiency for-temporal action detection.
arXiv Detail & Related papers (2023-04-01T08:06:43Z) - Leveraging Action Affinity and Continuity for Semi-supervised Temporal
Action Segmentation [24.325716686674042]
We present a semi-supervised learning approach to the temporal action segmentation task.
The goal of the task is to temporally detect and segment actions in long, untrimmed procedural videos.
We propose two novel loss functions for the unlabelled data: an action affinity loss and an action continuity loss.
arXiv Detail & Related papers (2022-07-18T14:52:37Z) - ETAD: A Unified Framework for Efficient Temporal Action Detection [70.21104995731085]
Untrimmed video understanding such as temporal action detection (TAD) often suffers from the pain of huge demand for computing resources.
We build a unified framework for efficient end-to-end temporal action detection (ETAD)
ETAD achieves state-of-the-art performance on both THUMOS-14 and ActivityNet-1.3.
arXiv Detail & Related papers (2022-05-14T21:16:21Z) - Part-level Action Parsing via a Pose-guided Coarse-to-Fine Framework [108.70949305791201]
Part-level Action Parsing (PAP) aims to not only predict the video-level action but also recognize the frame-level fine-grained actions or interactions of body parts for each person in the video.
In particular, our framework first predicts the video-level class of the input video, then localizes the body parts and predicts the part-level action.
Our framework achieves state-of-the-art performance and outperforms existing methods over a 31.10% ROC score.
arXiv Detail & Related papers (2022-03-09T01:30:57Z) - End-to-End Semi-Supervised Learning for Video Action Detection [23.042410033982193]
We propose a simple end-to-end based approach effectively which utilizes the unlabeled data.
Video action detection requires both, action class prediction as well as a-temporal consistency.
We demonstrate the effectiveness of the proposed approach on two different action detection benchmark datasets.
arXiv Detail & Related papers (2022-03-08T18:11:25Z) - Technical Report: Disentangled Action Parsing Networks for Accurate
Part-level Action Parsing [65.87931036949458]
Part-level Action Parsing aims at part state parsing for boosting action recognition in videos.
We present a simple yet effective approach, named disentangled action parsing (DAP)
arXiv Detail & Related papers (2021-11-05T02:29:32Z) - End-to-end Temporal Action Detection with Transformer [86.80289146697788]
Temporal action detection (TAD) aims to determine the semantic label and the boundaries of every action instance in an untrimmed video.
Here, we construct an end-to-end framework for TAD upon Transformer, termed textitTadTR.
Our method achieves state-of-the-art performance on HACS Segments and THUMOS14 and competitive performance on ActivityNet-1.3.
arXiv Detail & Related papers (2021-06-18T17:58:34Z) - Unsupervised Action Segmentation with Self-supervised Feature Learning
and Co-occurrence Parsing [32.66011849112014]
temporal action segmentation is a task to classify each frame in the video with an action label.
In this work we explore a self-supervised method that operates on a corpus of unlabeled videos and predicts a likely set of temporal segments across the videos.
We develop CAP, a novel co-occurrence action parsing algorithm that can not only capture the correlation among sub-actions underlying the structure of activities, but also estimate the temporal trajectory of the sub-actions in an accurate and general way.
arXiv Detail & Related papers (2021-05-29T00:29:40Z) - Learning Salient Boundary Feature for Anchor-free Temporal Action
Localization [81.55295042558409]
Temporal action localization is an important yet challenging task in video understanding.
We propose the first purely anchor-free temporal localization method.
Our model includes (i) an end-to-end trainable basic predictor, (ii) a saliency-based refinement module, and (iii) several consistency constraints.
arXiv Detail & Related papers (2021-03-24T12:28:32Z) - A Hybrid Attention Mechanism for Weakly-Supervised Temporal Action
Localization [12.353250130848044]
We present a novel framework named HAM-Net with a hybrid attention mechanism which includes temporal soft, semi-soft and hard attentions.
Our proposed approach outperforms recent state-of-the-art methods by at least 2.2% mAP at IoU threshold 0.5 on the THUMOS14 dataset.
arXiv Detail & Related papers (2021-01-03T03:08:18Z) - Weakly Supervised Temporal Action Localization Using Deep Metric
Learning [12.49814373580862]
We propose a weakly supervised temporal action localization method that only requires video-level action instances as supervision during training.
We jointly optimize a balanced binary cross-entropy loss and a metric loss using a standard backpropagation algorithm.
Our approach improves the current state-of-the-art result for THUMOS14 by 6.5% mAP at IoU threshold 0.5, and achieves competitive performance for ActivityNet1.2.
arXiv Detail & Related papers (2020-01-21T22:01:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.