Exploring Stronger Feature for Temporal Action Localization
- URL: http://arxiv.org/abs/2106.13014v1
- Date: Thu, 24 Jun 2021 13:46:30 GMT
- Title: Exploring Stronger Feature for Temporal Action Localization
- Authors: Zhiwu Qing and Xiang Wang and Ziyuan Huang and Yutong Feng and Shiwei
Zhang and jianwen Jiang and Mingqian Tang and Changxin Gao and Nong Sang
- Abstract summary: Temporal action localization aims to localize starting and ending time with action category.
We explored classic convolution-based backbones and the recent surge of transformer-based backbones.
We achieve 42.42% in terms of mAP on validation set with a single SlowFast feature by a simple combination.
- Score: 41.23726979184197
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal action localization aims to localize starting and ending time with
action category. Limited by GPU memory, mainstream methods pre-extract features
for each video. Therefore, feature quality determines the upper bound of
detection performance. In this technical report, we explored classic
convolution-based backbones and the recent surge of transformer-based
backbones. We found that the transformer-based methods can achieve better
classification performance than convolution-based, but they cannot generate
accuracy action proposals. In addition, extracting features with larger frame
resolution to reduce the loss of spatial information can also effectively
improve the performance of temporal action localization. Finally, we achieve
42.42% in terms of mAP on validation set with a single SlowFast feature by a
simple combination: BMN+TCANet, which is 1.87% higher than the result of 2020's
multi-model ensemble. Finally, we achieve Rank 1st on the CVPR2021 HACS
supervised Temporal Action Localization Challenge.
Related papers
- Proposal-based Temporal Action Localization with Point-level Supervision [29.98225940694062]
Point-level supervised temporal action localization (PTAL) aims at recognizing and localizing actions in untrimmed videos.
We propose a novel method that localizes actions by generating and evaluating action proposals of flexible duration.
Experiments show that our proposed method achieves competitive or superior performance to the state-of-the-art methods.
arXiv Detail & Related papers (2023-10-09T08:27:05Z) - Temporal Action Localization with Multi-temporal Scales [54.69057924183867]
We propose to predict actions on a feature space of multi-temporal scales.
Specifically, we use refined feature pyramids of different scales to pass semantics from high-level scales to low-level scales.
The proposed method can achieve improvements of 12.6%, 17.4% and 2.2%, respectively.
arXiv Detail & Related papers (2022-08-16T01:48:23Z) - An Efficient Spatio-Temporal Pyramid Transformer for Action Detection [40.68615998427292]
We present an efficient hierarchical Spatio-Temporal Pyramid Transformer (STPT) video framework for action detection.
Specifically, we propose to use local window attention to encode local-temporal rich-time representations in the early stages while applying global attention to capture long-term space-time dependencies in the later stages.
In this way, ourSTPT can encode both locality and dependency with largely reduced redundancy, delivering a promising trade-off between accuracy and efficiency.
arXiv Detail & Related papers (2022-07-21T12:38:05Z) - Towards High-Quality Temporal Action Detection with Sparse Proposals [14.923321325749196]
Temporal Action Detection aims to localize the temporal segments containing human action instances and predict the action categories.
We introduce Sparse Proposals to interact with the hierarchical features.
Experiments demonstrate the effectiveness of our method, especially under high tIoU thresholds.
arXiv Detail & Related papers (2021-09-18T06:15:19Z) - Augmented Transformer with Adaptive Graph for Temporal Action Proposal
Generation [79.98992138865042]
We present an augmented transformer with adaptive graph network (ATAG) to exploit both long-range and local temporal contexts for TAPG.
Specifically, we enhance the vanilla transformer by equipping a snippet actionness loss and a front block, dubbed augmented transformer.
An adaptive graph convolutional network (GCN) is proposed to build local temporal context by mining the position information and difference between adjacent features.
arXiv Detail & Related papers (2021-03-30T02:01:03Z) - Temporal Context Aggregation Network for Temporal Action Proposal
Refinement [93.03730692520999]
Temporal action proposal generation is a challenging yet important task in the video understanding field.
Current methods still suffer from inaccurate temporal boundaries and inferior confidence used for retrieval.
We propose TCANet to generate high-quality action proposals through "local and global" temporal context aggregation.
arXiv Detail & Related papers (2021-03-24T12:34:49Z) - Learning Salient Boundary Feature for Anchor-free Temporal Action
Localization [81.55295042558409]
Temporal action localization is an important yet challenging task in video understanding.
We propose the first purely anchor-free temporal localization method.
Our model includes (i) an end-to-end trainable basic predictor, (ii) a saliency-based refinement module, and (iii) several consistency constraints.
arXiv Detail & Related papers (2021-03-24T12:28:32Z) - Finding Action Tubes with a Sparse-to-Dense Framework [62.60742627484788]
We propose a framework that generates action tube proposals from video streams with a single forward pass in a sparse-to-dense manner.
We evaluate the efficacy of our model on the UCF101-24, JHMDB-21 and UCFSports benchmark datasets.
arXiv Detail & Related papers (2020-08-30T15:38:44Z) - CFAD: Coarse-to-Fine Action Detector for Spatiotemporal Action
Localization [42.95186231216036]
We propose Coarse-to-Fine Action Detector (CFAD) for efficient action localization.
CFAD first estimates coarse tubes-temporal action tubes from video streams, and then refines location based on key timestamps.
arXiv Detail & Related papers (2020-08-19T08:47:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.