Temporal Action Localization with Multi-temporal Scales
- URL: http://arxiv.org/abs/2208.07493v1
- Date: Tue, 16 Aug 2022 01:48:23 GMT
- Title: Temporal Action Localization with Multi-temporal Scales
- Authors: Zan Gao, Xinglei Cui, Tao Zhuo, Zhiyong Cheng, An-An Liu, Meng Wang,
and Shenyong Chen
- Abstract summary: We propose to predict actions on a feature space of multi-temporal scales.
Specifically, we use refined feature pyramids of different scales to pass semantics from high-level scales to low-level scales.
The proposed method can achieve improvements of 12.6%, 17.4% and 2.2%, respectively.
- Score: 54.69057924183867
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal action localization plays an important role in video analysis, which
aims to localize and classify actions in untrimmed videos. The previous methods
often predict actions on a feature space of a single-temporal scale. However,
the temporal features of a low-level scale lack enough semantics for action
classification while a high-level scale cannot provide rich details of the
action boundaries. To address this issue, we propose to predict actions on a
feature space of multi-temporal scales. Specifically, we use refined feature
pyramids of different scales to pass semantics from high-level scales to
low-level scales. Besides, to establish the long temporal scale of the entire
video, we use a spatial-temporal transformer encoder to capture the long-range
dependencies of video frames. Then the refined features with long-range
dependencies are fed into a classifier for the coarse action prediction.
Finally, to further improve the prediction accuracy, we propose to use a
frame-level self attention module to refine the classification and boundaries
of each action instance. Extensive experiments show that the proposed method
can outperform state-of-the-art approaches on the THUMOS14 dataset and achieves
comparable performance on the ActivityNet1.3 dataset. Compared with A2Net
(TIP20, Avg\{0.3:0.7\}), Sub-Action (CSVT2022, Avg\{0.1:0.5\}), and AFSD
(CVPR21, Avg\{0.3:0.7\}) on the THUMOS14 dataset, the proposed method can
achieve improvements of 12.6\%, 17.4\% and 2.2\%, respectively
Related papers
- Temporal Action Localization with Enhanced Instant Discriminability [66.76095239972094]
Temporal action detection (TAD) aims to detect all action boundaries and their corresponding categories in an untrimmed video.
We propose a one-stage framework named TriDet to resolve imprecise predictions of action boundaries by existing methods.
Experimental results demonstrate the robustness of TriDet and its state-of-the-art performance on multiple TAD datasets.
arXiv Detail & Related papers (2023-09-11T16:17:50Z) - Boundary-Denoising for Video Activity Localization [57.9973253014712]
We study the video activity localization problem from a denoising perspective.
Specifically, we propose an encoder-decoder model named DenoiseLoc.
Experiments show that DenoiseLoc advances %in several video activity understanding tasks.
arXiv Detail & Related papers (2023-04-06T08:48:01Z) - Post-Processing Temporal Action Detection [134.26292288193298]
Temporal Action Detection (TAD) methods typically take a pre-processing step in converting an input varying-length video into a fixed-length snippet representation sequence.
This pre-processing step would temporally downsample the video, reducing the inference resolution and hampering the detection performance in the original temporal resolution.
We introduce a novel model-agnostic post-processing method without model redesign and retraining.
arXiv Detail & Related papers (2022-11-27T19:50:37Z) - Adaptive Perception Transformer for Temporal Action Localization [13.735402329482719]
This paper proposes a novel end-to-end model, called adaptive perception transformer (AdaPerFormer)
One branch takes care of the global perception attention, which can model entire video sequences and aggregate global relevant contexts.
The other branch concentrates on the local convolutional shift to aggregate intra-frame and inter-frame information.
arXiv Detail & Related papers (2022-08-25T07:42:48Z) - CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point
Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation.
We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration.
The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z) - Towards High-Quality Temporal Action Detection with Sparse Proposals [14.923321325749196]
Temporal Action Detection aims to localize the temporal segments containing human action instances and predict the action categories.
We introduce Sparse Proposals to interact with the hierarchical features.
Experiments demonstrate the effectiveness of our method, especially under high tIoU thresholds.
arXiv Detail & Related papers (2021-09-18T06:15:19Z) - Exploring Stronger Feature for Temporal Action Localization [41.23726979184197]
Temporal action localization aims to localize starting and ending time with action category.
We explored classic convolution-based backbones and the recent surge of transformer-based backbones.
We achieve 42.42% in terms of mAP on validation set with a single SlowFast feature by a simple combination.
arXiv Detail & Related papers (2021-06-24T13:46:30Z) - Learning Salient Boundary Feature for Anchor-free Temporal Action
Localization [81.55295042558409]
Temporal action localization is an important yet challenging task in video understanding.
We propose the first purely anchor-free temporal localization method.
Our model includes (i) an end-to-end trainable basic predictor, (ii) a saliency-based refinement module, and (iii) several consistency constraints.
arXiv Detail & Related papers (2021-03-24T12:28:32Z) - Weakly Supervised Temporal Action Localization Using Deep Metric
Learning [12.49814373580862]
We propose a weakly supervised temporal action localization method that only requires video-level action instances as supervision during training.
We jointly optimize a balanced binary cross-entropy loss and a metric loss using a standard backpropagation algorithm.
Our approach improves the current state-of-the-art result for THUMOS14 by 6.5% mAP at IoU threshold 0.5, and achieves competitive performance for ActivityNet1.2.
arXiv Detail & Related papers (2020-01-21T22:01:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.