DCAN: Improving Temporal Action Detection via Dual Context Aggregation
- URL: http://arxiv.org/abs/2112.03612v1
- Date: Tue, 7 Dec 2021 10:14:26 GMT
- Title: DCAN: Improving Temporal Action Detection via Dual Context Aggregation
- Authors: Guo Chen, Yin-Dong Zheng, Limin Wang, Tong Lu
- Abstract summary: Temporal action detection aims to locate the boundaries of action in the video.
The current method based on boundary matching enumerates and calculates all possible boundary matchings to generate proposals.
We propose Dual Context Aggregation Network (DCAN) to aggregate context on two levels, namely, boundary level and proposal level.
- Score: 29.46851768470807
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Temporal action detection aims to locate the boundaries of action in the
video. The current method based on boundary matching enumerates and calculates
all possible boundary matchings to generate proposals. However, these methods
neglect the long-range context aggregation in boundary prediction. At the same
time, due to the similar semantics of adjacent matchings, local semantic
aggregation of densely-generated matchings cannot improve semantic richness and
discrimination. In this paper, we propose the end-to-end proposal generation
method named Dual Context Aggregation Network (DCAN) to aggregate context on
two levels, namely, boundary level and proposal level, for generating
high-quality action proposals, thereby improving the performance of temporal
action detection. Specifically, we design the Multi-Path Temporal Context
Aggregation (MTCA) to achieve smooth context aggregation on boundary level and
precise evaluation of boundaries. For matching evaluation, Coarse-to-fine
Matching (CFM) is designed to aggregate context on the proposal level and
refine the matching map from coarse to fine. We conduct extensive experiments
on ActivityNet v1.3 and THUMOS-14. DCAN obtains an average mAP of 35.39% on
ActivityNet v1.3 and reaches mAP 54.14% at IoU@0.5 on THUMOS-14, which
demonstrates DCAN can generate high-quality proposals and achieve
state-of-the-art performance. We release the code at
https://github.com/cg1177/DCAN.
Related papers
- Temporal Action Localization with Enhanced Instant Discriminability [66.76095239972094]
Temporal action detection (TAD) aims to detect all action boundaries and their corresponding categories in an untrimmed video.
We propose a one-stage framework named TriDet to resolve imprecise predictions of action boundaries by existing methods.
Experimental results demonstrate the robustness of TriDet and its state-of-the-art performance on multiple TAD datasets.
arXiv Detail & Related papers (2023-09-11T16:17:50Z) - Temporal Action Localization with Multi-temporal Scales [54.69057924183867]
We propose to predict actions on a feature space of multi-temporal scales.
Specifically, we use refined feature pyramids of different scales to pass semantics from high-level scales to low-level scales.
The proposed method can achieve improvements of 12.6%, 17.4% and 2.2%, respectively.
arXiv Detail & Related papers (2022-08-16T01:48:23Z) - Context-aware Proposal Network for Temporal Action Detection [47.72048484299649]
This report presents our first place winning solution for temporal action detection task in CVPR-2022 AcitivityNet Challenge.
The task aims to localize temporal boundaries of action instances with specific classes in long untrimmed videos.
We argue that the generated proposals contain rich contextual information, which may benefits detection confidence prediction.
arXiv Detail & Related papers (2022-06-18T01:43:43Z) - Estimation of Reliable Proposal Quality for Temporal Action Detection [71.5989469643732]
We propose a new method that gives insights into moment and region perspectives simultaneously to align the two tasks by acquiring reliable proposal quality.
For the moment perspective, Boundary Evaluate Module (BEM) is designed which focuses on local appearance and motion evolvement to estimate boundary quality.
For the region perspective, we introduce Region Evaluate Module (REM) which uses a new and efficient sampling method for proposal feature representation.
arXiv Detail & Related papers (2022-04-25T14:33:49Z) - Towards High-Quality Temporal Action Detection with Sparse Proposals [14.923321325749196]
Temporal Action Detection aims to localize the temporal segments containing human action instances and predict the action categories.
We introduce Sparse Proposals to interact with the hierarchical features.
Experiments demonstrate the effectiveness of our method, especially under high tIoU thresholds.
arXiv Detail & Related papers (2021-09-18T06:15:19Z) - Temporal Context Aggregation Network for Temporal Action Proposal
Refinement [93.03730692520999]
Temporal action proposal generation is a challenging yet important task in the video understanding field.
Current methods still suffer from inaccurate temporal boundaries and inferior confidence used for retrieval.
We propose TCANet to generate high-quality action proposals through "local and global" temporal context aggregation.
arXiv Detail & Related papers (2021-03-24T12:34:49Z) - Learning Salient Boundary Feature for Anchor-free Temporal Action
Localization [81.55295042558409]
Temporal action localization is an important yet challenging task in video understanding.
We propose the first purely anchor-free temporal localization method.
Our model includes (i) an end-to-end trainable basic predictor, (ii) a saliency-based refinement module, and (iii) several consistency constraints.
arXiv Detail & Related papers (2021-03-24T12:28:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.