Estimation of Reliable Proposal Quality for Temporal Action Detection
- URL: http://arxiv.org/abs/2204.11695v1
- Date: Mon, 25 Apr 2022 14:33:49 GMT
- Title: Estimation of Reliable Proposal Quality for Temporal Action Detection
- Authors: Junshan Hu, Chaoxu guo, Liansheng Zhuang, Biao Wang, Tiezheng Ge,
Yuning Jiang, Houqiang Li
- Abstract summary: We propose a new method that gives insights into moment and region perspectives simultaneously to align the two tasks by acquiring reliable proposal quality.
For the moment perspective, Boundary Evaluate Module (BEM) is designed which focuses on local appearance and motion evolvement to estimate boundary quality.
For the region perspective, we introduce Region Evaluate Module (REM) which uses a new and efficient sampling method for proposal feature representation.
- Score: 71.5989469643732
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Temporal action detection (TAD) aims to locate and recognize the actions in
an untrimmed video. Anchor-free methods have made remarkable progress which
mainly formulate TAD into two tasks: classification and localization using two
separate branches. This paper reveals the temporal misalignment between the two
tasks hindering further progress. To address this, we propose a new method that
gives insights into moment and region perspectives simultaneously to align the
two tasks by acquiring reliable proposal quality. For the moment perspective,
Boundary Evaluate Module (BEM) is designed which focuses on local appearance
and motion evolvement to estimate boundary quality and adopts a multi-scale
manner to deal with varied action durations. For the region perspective, we
introduce Region Evaluate Module (REM) which uses a new and efficient sampling
method for proposal feature representation containing more contextual
information compared with point feature to refine category score and proposal
boundary. The proposed Boundary Evaluate Module and Region Evaluate Module
(BREM) are generic, and they can be easily integrated with other anchor-free
TAD methods to achieve superior performance. In our experiments, BREM is
combined with two different frameworks and improves the performance on THUMOS14
by 3.6$\%$ and 1.0$\%$ respectively, reaching a new state-of-the-art (63.6$\%$
average $m$AP). Meanwhile, a competitive result of 36.2\% average $m$AP is
achieved on ActivityNet-1.3 with the consistent improvement of BREM.
Related papers
- Proposal-Based Multiple Instance Learning for Weakly-Supervised Temporal
Action Localization [98.66318678030491]
Weakly-supervised temporal action localization aims to localize and recognize actions in untrimmed videos with only video-level category labels during training.
We propose a novel Proposal-based Multiple Instance Learning (P-MIL) framework that directly classifies the candidate proposals in both the training and testing stages.
arXiv Detail & Related papers (2023-05-29T02:48:04Z) - DIR-AS: Decoupling Individual Identification and Temporal Reasoning for
Action Segmentation [84.78383981697377]
Fully supervised action segmentation works on frame-wise action recognition with dense annotations and often suffers from the over-segmentation issue.
We develop a novel local-global attention mechanism with temporal pyramid dilation and temporal pyramid pooling for efficient multi-scale attention.
We achieve state-of-the-art accuracy, eg, 82.8% (+2.6%) on GTEA and 74.7% (+1.2%) on Breakfast, which demonstrates the effectiveness of our proposed method.
arXiv Detail & Related papers (2023-04-04T20:27:18Z) - Faster Learning of Temporal Action Proposal via Sparse Multilevel
Boundary Generator [9.038216757761955]
Temporal action localization in videos presents significant challenges in the field of computer vision.
We propose a novel framework, Sparse Multilevel Boundary Generator (SMBG), which enhances the boundary-sensitive method with boundary classification and action completeness regression.
Our method is evaluated on two popular benchmarks, ActivityNet-1.3 and THUMOS14, and is shown to achieve state-of-the-art performance, with a better inference speed (2.47xBSN++, 2.12xDBG)
arXiv Detail & Related papers (2023-03-06T14:26:56Z) - DCAN: Improving Temporal Action Detection via Dual Context Aggregation [29.46851768470807]
Temporal action detection aims to locate the boundaries of action in the video.
The current method based on boundary matching enumerates and calculates all possible boundary matchings to generate proposals.
We propose Dual Context Aggregation Network (DCAN) to aggregate context on two levels, namely, boundary level and proposal level.
arXiv Detail & Related papers (2021-12-07T10:14:26Z) - Towards High-Quality Temporal Action Detection with Sparse Proposals [14.923321325749196]
Temporal Action Detection aims to localize the temporal segments containing human action instances and predict the action categories.
We introduce Sparse Proposals to interact with the hierarchical features.
Experiments demonstrate the effectiveness of our method, especially under high tIoU thresholds.
arXiv Detail & Related papers (2021-09-18T06:15:19Z) - Adaptive Mutual Supervision for Weakly-Supervised Temporal Action
Localization [92.96802448718388]
We introduce an adaptive mutual supervision framework (AMS) for temporal action localization.
The proposed AMS method significantly outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2021-04-06T08:31:10Z) - Temporal Context Aggregation Network for Temporal Action Proposal
Refinement [93.03730692520999]
Temporal action proposal generation is a challenging yet important task in the video understanding field.
Current methods still suffer from inaccurate temporal boundaries and inferior confidence used for retrieval.
We propose TCANet to generate high-quality action proposals through "local and global" temporal context aggregation.
arXiv Detail & Related papers (2021-03-24T12:34:49Z) - Learning Salient Boundary Feature for Anchor-free Temporal Action
Localization [81.55295042558409]
Temporal action localization is an important yet challenging task in video understanding.
We propose the first purely anchor-free temporal localization method.
Our model includes (i) an end-to-end trainable basic predictor, (ii) a saliency-based refinement module, and (iii) several consistency constraints.
arXiv Detail & Related papers (2021-03-24T12:28:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.