Temporal Action Localization with Enhanced Instant Discriminability
- URL: http://arxiv.org/abs/2309.05590v1
- Date: Mon, 11 Sep 2023 16:17:50 GMT
- Title: Temporal Action Localization with Enhanced Instant Discriminability
- Authors: Dingfeng Shi, Qiong Cao, Yujie Zhong, Shan An, Jian Cheng, Haogang
Zhu, Dacheng Tao
- Abstract summary: Temporal action detection (TAD) aims to detect all action boundaries and their corresponding categories in an untrimmed video.
We propose a one-stage framework named TriDet to resolve imprecise predictions of action boundaries by existing methods.
Experimental results demonstrate the robustness of TriDet and its state-of-the-art performance on multiple TAD datasets.
- Score: 66.76095239972094
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal action detection (TAD) aims to detect all action boundaries and
their corresponding categories in an untrimmed video. The unclear boundaries of
actions in videos often result in imprecise predictions of action boundaries by
existing methods. To resolve this issue, we propose a one-stage framework named
TriDet. First, we propose a Trident-head to model the action boundary via an
estimated relative probability distribution around the boundary. Then, we
analyze the rank-loss problem (i.e. instant discriminability deterioration) in
transformer-based methods and propose an efficient scalable-granularity
perception (SGP) layer to mitigate this issue. To further push the limit of
instant discriminability in the video backbone, we leverage the strong
representation capability of pretrained large models and investigate their
performance on TAD. Last, considering the adequate spatial-temporal context for
classification, we design a decoupled feature pyramid network with separate
feature pyramids to incorporate rich spatial context from the large model for
localization. Experimental results demonstrate the robustness of TriDet and its
state-of-the-art performance on multiple TAD datasets, including hierarchical
(multilabel) TAD datasets.
Related papers
- Boundary-Aware Proposal Generation Method for Temporal Action
Localization [23.79359799496947]
TAL aims to find the categories and temporal boundaries of actions in an untrimmed video.
Most TAL methods rely heavily on action recognition models that are sensitive to action labels rather than temporal boundaries.
We propose a Boundary-Aware Proposal Generation (BAPG) method with contrastive learning.
arXiv Detail & Related papers (2023-09-25T01:41:09Z) - TriDet: Temporal Action Detection with Relative Boundary Modeling [85.49834276225484]
Existing methods often suffer from imprecise boundary predictions due to ambiguous action boundaries in videos.
We propose a novel Trident-head to model the action boundary via an estimated relative probability distribution around the boundary.
TriDet achieves state-of-the-art performance on three challenging benchmarks.
arXiv Detail & Related papers (2023-03-13T17:59:59Z) - Semi-Supervised Temporal Action Detection with Proposal-Free Masking [134.26292288193298]
We propose a novel Semi-supervised Temporal action detection model based on PropOsal-free Temporal mask (SPOT)
SPOT outperforms state-of-the-art alternatives, often by a large margin.
arXiv Detail & Related papers (2022-07-14T16:58:47Z) - LocATe: End-to-end Localization of Actions in 3D with Transformers [91.28982770522329]
LocATe is an end-to-end approach that jointly localizes and recognizes actions in a 3D sequence.
Unlike transformer-based object-detection and classification models which consider image or patch features as input, LocATe's transformer model is capable of capturing long-term correlations between actions in a sequence.
We introduce a new, challenging, and more realistic benchmark dataset, BABEL-TAL-20 (BT20), where the performance of state-of-the-art methods is significantly worse.
arXiv Detail & Related papers (2022-03-21T03:35:32Z) - Towards High-Quality Temporal Action Detection with Sparse Proposals [14.923321325749196]
Temporal Action Detection aims to localize the temporal segments containing human action instances and predict the action categories.
We introduce Sparse Proposals to interact with the hierarchical features.
Experiments demonstrate the effectiveness of our method, especially under high tIoU thresholds.
arXiv Detail & Related papers (2021-09-18T06:15:19Z) - Learning Salient Boundary Feature for Anchor-free Temporal Action
Localization [81.55295042558409]
Temporal action localization is an important yet challenging task in video understanding.
We propose the first purely anchor-free temporal localization method.
Our model includes (i) an end-to-end trainable basic predictor, (ii) a saliency-based refinement module, and (iii) several consistency constraints.
arXiv Detail & Related papers (2021-03-24T12:28:32Z) - Boundary-sensitive Pre-training for Temporal Localization in Videos [124.40788524169668]
We investigate model pre-training for temporal localization by introducing a novel boundary-sensitive pretext ( BSP) task.
With the synthesized boundaries, BSP can be simply conducted via classifying the boundary types.
Extensive experiments show that the proposed BSP is superior and complementary to the existing action classification based pre-training counterpart.
arXiv Detail & Related papers (2020-11-21T17:46:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.