PointTAD: Multi-Label Temporal Action Detection with Learnable Query
Points
- URL: http://arxiv.org/abs/2210.11035v3
- Date: Tue, 21 Mar 2023 16:03:50 GMT
- Title: PointTAD: Multi-Label Temporal Action Detection with Learnable Query
Points
- Authors: Jing Tan, Xiaotong Zhao, Xintian Shi, Bin Kang, Limin Wang
- Abstract summary: temporal action detection (TAD) usually handles untrimmed videos with small number of action instances from a single label.
In this paper, we focus on the task of multi-label temporal action detection that aims to localize all action instances from a multi-label untrimmed video.
We extend the sparse query-based detection paradigm from the traditional TAD and propose the multi-label TAD framework of PointTAD.
- Score: 28.607690605262878
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Traditional temporal action detection (TAD) usually handles untrimmed videos
with small number of action instances from a single label (e.g., ActivityNet,
THUMOS). However, this setting might be unrealistic as different classes of
actions often co-occur in practice. In this paper, we focus on the task of
multi-label temporal action detection that aims to localize all action
instances from a multi-label untrimmed video. Multi-label TAD is more
challenging as it requires for fine-grained class discrimination within a
single video and precise localization of the co-occurring instances. To
mitigate this issue, we extend the sparse query-based detection paradigm from
the traditional TAD and propose the multi-label TAD framework of PointTAD.
Specifically, our PointTAD introduces a small set of learnable query points to
represent the important frames of each action instance. This point-based
representation provides a flexible mechanism to localize the discriminative
frames at boundaries and as well the important frames inside the action.
Moreover, we perform the action decoding process with the Multi-level
Interactive Module to capture both point-level and instance-level action
semantics. Finally, our PointTAD employs an end-to-end trainable framework
simply based on RGB input for easy deployment. We evaluate our proposed method
on two popular benchmarks and introduce the new metric of detection-mAP for
multi-label TAD. Our model outperforms all previous methods by a large margin
under the detection-mAP metric, and also achieves promising results under the
segmentation-mAP metric. Code is available at
https://github.com/MCG-NJU/PointTAD.
Related papers
- Dual DETRs for Multi-Label Temporal Action Detection [46.05173000284639]
Temporal Action Detection (TAD) aims to identify the action boundaries and the corresponding category within untrimmed videos.
We propose a new Dual-level query-based TAD framework, namely DualDETR, to detect actions from both instance-level and boundary-level.
We evaluate DualDETR on three challenging multi-label TAD benchmarks.
arXiv Detail & Related papers (2024-03-31T11:43:39Z) - Proposal-based Temporal Action Localization with Point-level Supervision [29.98225940694062]
Point-level supervised temporal action localization (PTAL) aims at recognizing and localizing actions in untrimmed videos.
We propose a novel method that localizes actions by generating and evaluating action proposals of flexible duration.
Experiments show that our proposed method achieves competitive or superior performance to the state-of-the-art methods.
arXiv Detail & Related papers (2023-10-09T08:27:05Z) - Temporal Action Detection with Global Segmentation Mask Learning [134.26292288193298]
Existing temporal action detection (TAD) methods rely on generating an overwhelmingly large number of proposals per video.
We propose a proposal-free Temporal Action detection model with Global mask (TAGS)
Our core idea is to learn a global segmentation mask of each action instance jointly at the full video length.
arXiv Detail & Related papers (2022-07-14T00:46:51Z) - Tag-Based Attention Guided Bottom-Up Approach for Video Instance
Segmentation [83.13610762450703]
Video instance is a fundamental computer vision task that deals with segmenting and tracking object instances across a video sequence.
We introduce a simple end-to-end train bottomable-up approach to achieve instance mask predictions at the pixel-level granularity, instead of the typical region-proposals-based approach.
Our method provides competitive results on YouTube-VIS and DAVIS-19 datasets, and has minimum run-time compared to other contemporary state-of-the-art performance methods.
arXiv Detail & Related papers (2022-04-22T15:32:46Z) - SegTAD: Precise Temporal Action Detection via Semantic Segmentation [65.01826091117746]
We formulate the task of temporal action detection in a novel perspective of semantic segmentation.
Owing to the 1-dimensional property of TAD, we are able to convert the coarse-grained detection annotations to fine-grained semantic segmentation annotations for free.
We propose an end-to-end framework SegTAD composed of a 1D semantic segmentation network (1D-SSN) and a proposal detection network (PDN)
arXiv Detail & Related papers (2022-03-03T06:52:13Z) - Target-Aware Object Discovery and Association for Unsupervised Video
Multi-Object Segmentation [79.6596425920849]
This paper addresses the task of unsupervised video multi-object segmentation.
We introduce a novel approach for more accurate and efficient unseen-temporal segmentation.
We evaluate the proposed approach on DAVIS$_17$ and YouTube-VIS, and the results demonstrate that it outperforms state-of-the-art methods both in segmentation accuracy and inference speed.
arXiv Detail & Related papers (2021-04-10T14:39:44Z) - Learning Salient Boundary Feature for Anchor-free Temporal Action
Localization [81.55295042558409]
Temporal action localization is an important yet challenging task in video understanding.
We propose the first purely anchor-free temporal localization method.
Our model includes (i) an end-to-end trainable basic predictor, (ii) a saliency-based refinement module, and (iii) several consistency constraints.
arXiv Detail & Related papers (2021-03-24T12:28:32Z) - Discovering Multi-Label Actor-Action Association in a Weakly Supervised
Setting [22.86745487695168]
We propose a baseline based on multi-instance and multi-label learning.
We propose a novel approach that uses sets of actions as representation instead of modeling individual action classes.
We evaluate the proposed approach on the challenging dataset where the proposed approach outperforms the MIML baseline and is competitive to fully supervised approaches.
arXiv Detail & Related papers (2021-01-21T11:59:47Z) - Few-shot 3D Point Cloud Semantic Segmentation [138.80825169240302]
We propose a novel attention-aware multi-prototype transductive few-shot point cloud semantic segmentation method.
Our proposed method shows significant and consistent improvements compared to baselines in different few-shot point cloud semantic segmentation settings.
arXiv Detail & Related papers (2020-06-22T08:05:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.