Towards High-Quality Temporal Action Detection with Sparse Proposals
- URL: http://arxiv.org/abs/2109.08847v1
- Date: Sat, 18 Sep 2021 06:15:19 GMT
- Title: Towards High-Quality Temporal Action Detection with Sparse Proposals
- Authors: Jiannan Wu, Peize Sun, Shoufa Chen, Jiewen Yang, Zihao Qi, Lan Ma,
Ping Luo
- Abstract summary: Temporal Action Detection aims to localize the temporal segments containing human action instances and predict the action categories.
We introduce Sparse Proposals to interact with the hierarchical features.
Experiments demonstrate the effectiveness of our method, especially under high tIoU thresholds.
- Score: 14.923321325749196
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal Action Detection (TAD) is an essential and challenging topic in
video understanding, aiming to localize the temporal segments containing human
action instances and predict the action categories. The previous works greatly
rely upon dense candidates either by designing varying anchors or enumerating
all the combinations of boundaries on video sequences; therefore, they are
related to complicated pipelines and sensitive hand-crafted designs. Recently,
with the resurgence of Transformer, query-based methods have tended to become
the rising solutions for their simplicity and flexibility. However, there still
exists a performance gap between query-based methods and well-established
methods. In this paper, we identify the main challenge lies in the large
variants of action duration and the ambiguous boundaries for short action
instances; nevertheless, quadratic-computational global attention prevents
query-based methods to build multi-scale feature maps. Towards high-quality
temporal action detection, we introduce Sparse Proposals to interact with the
hierarchical features. In our method, named SP-TAD, each proposal attends to a
local segment feature in the temporal feature pyramid. The local interaction
enables utilization of high-resolution features to preserve action instances
details. Extensive experiments demonstrate the effectiveness of our method,
especially under high tIoU thresholds. E.g., we achieve the state-of-the-art
performance on THUMOS14 (45.7% on mAP@0.6, 33.4% on mAP@0.7 and 53.5% on
mAP@Avg) and competitive results on ActivityNet-1.3 (32.99% on mAP@Avg). Code
will be made available at https://github.com/wjn922/SP-TAD.
Related papers
- FMI-TAL: Few-shot Multiple Instances Temporal Action Localization by Probability Distribution Learning and Interval Cluster Refinement [2.261014973523156]
We propose a novel solution involving a spatial-channel relation transformer with probability learning and cluster refinement.
This method can accurately identify the start and end boundaries of actions in the query video.
Our model achieves competitive performance through meticulous experimentation utilizing the benchmark datasets ActivityNet1.3 and THUMOS14.
arXiv Detail & Related papers (2024-08-25T08:17:25Z) - Proposal-based Temporal Action Localization with Point-level Supervision [29.98225940694062]
Point-level supervised temporal action localization (PTAL) aims at recognizing and localizing actions in untrimmed videos.
We propose a novel method that localizes actions by generating and evaluating action proposals of flexible duration.
Experiments show that our proposed method achieves competitive or superior performance to the state-of-the-art methods.
arXiv Detail & Related papers (2023-10-09T08:27:05Z) - Temporal Action Localization with Enhanced Instant Discriminability [66.76095239972094]
Temporal action detection (TAD) aims to detect all action boundaries and their corresponding categories in an untrimmed video.
We propose a one-stage framework named TriDet to resolve imprecise predictions of action boundaries by existing methods.
Experimental results demonstrate the robustness of TriDet and its state-of-the-art performance on multiple TAD datasets.
arXiv Detail & Related papers (2023-09-11T16:17:50Z) - Adaptive Perception Transformer for Temporal Action Localization [13.735402329482719]
This paper proposes a novel end-to-end model, called adaptive perception transformer (AdaPerFormer)
One branch takes care of the global perception attention, which can model entire video sequences and aggregate global relevant contexts.
The other branch concentrates on the local convolutional shift to aggregate intra-frame and inter-frame information.
arXiv Detail & Related papers (2022-08-25T07:42:48Z) - Temporal Action Detection with Global Segmentation Mask Learning [134.26292288193298]
Existing temporal action detection (TAD) methods rely on generating an overwhelmingly large number of proposals per video.
We propose a proposal-free Temporal Action detection model with Global mask (TAGS)
Our core idea is to learn a global segmentation mask of each action instance jointly at the full video length.
arXiv Detail & Related papers (2022-07-14T00:46:51Z) - Augmented Transformer with Adaptive Graph for Temporal Action Proposal
Generation [79.98992138865042]
We present an augmented transformer with adaptive graph network (ATAG) to exploit both long-range and local temporal contexts for TAPG.
Specifically, we enhance the vanilla transformer by equipping a snippet actionness loss and a front block, dubbed augmented transformer.
An adaptive graph convolutional network (GCN) is proposed to build local temporal context by mining the position information and difference between adjacent features.
arXiv Detail & Related papers (2021-03-30T02:01:03Z) - Temporal Context Aggregation Network for Temporal Action Proposal
Refinement [93.03730692520999]
Temporal action proposal generation is a challenging yet important task in the video understanding field.
Current methods still suffer from inaccurate temporal boundaries and inferior confidence used for retrieval.
We propose TCANet to generate high-quality action proposals through "local and global" temporal context aggregation.
arXiv Detail & Related papers (2021-03-24T12:34:49Z) - Learning Salient Boundary Feature for Anchor-free Temporal Action
Localization [81.55295042558409]
Temporal action localization is an important yet challenging task in video understanding.
We propose the first purely anchor-free temporal localization method.
Our model includes (i) an end-to-end trainable basic predictor, (ii) a saliency-based refinement module, and (iii) several consistency constraints.
arXiv Detail & Related papers (2021-03-24T12:28:32Z) - Finding Action Tubes with a Sparse-to-Dense Framework [62.60742627484788]
We propose a framework that generates action tube proposals from video streams with a single forward pass in a sparse-to-dense manner.
We evaluate the efficacy of our model on the UCF101-24, JHMDB-21 and UCFSports benchmark datasets.
arXiv Detail & Related papers (2020-08-30T15:38:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.