Temporal Action Detection with Global Segmentation Mask Learning
- URL: http://arxiv.org/abs/2207.06580v1
- Date: Thu, 14 Jul 2022 00:46:51 GMT
- Title: Temporal Action Detection with Global Segmentation Mask Learning
- Authors: Sauradip Nag, Xiatian Zhu, Yi-Zhe Song and Tao Xiang
- Abstract summary: Existing temporal action detection (TAD) methods rely on generating an overwhelmingly large number of proposals per video.
We propose a proposal-free Temporal Action detection model with Global mask (TAGS)
Our core idea is to learn a global segmentation mask of each action instance jointly at the full video length.
- Score: 134.26292288193298
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing temporal action detection (TAD) methods rely on generating an
overwhelmingly large number of proposals per video. This leads to complex model
designs due to proposal generation and/or per-proposal action instance
evaluation and the resultant high computational cost. In this work, for the
first time, we propose a proposal-free Temporal Action detection model with
Global Segmentation mask (TAGS). Our core idea is to learn a global
segmentation mask of each action instance jointly at the full video length. The
TAGS model differs significantly from the conventional proposal-based methods
by focusing on global temporal representation learning to directly detect local
start and end points of action instances without proposals. Further, by
modeling TAD holistically rather than locally at the individual proposal level,
TAGS needs a much simpler model architecture with lower computational cost.
Extensive experiments show that despite its simpler design, TAGS outperforms
existing TAD methods, achieving new state-of-the-art performance on two
benchmarks. Importantly, it is ~ 20x faster to train and ~1.6x more efficient
for inference. Our PyTorch implementation of TAGS is available at
https://github.com/sauradip/TAGS .
Related papers
- Open-Vocabulary Spatio-Temporal Action Detection [59.91046192096296]
Open-vocabulary-temporal action detection (OV-STAD) is an important fine-grained video understanding task.
OV-STAD requires training a model on a limited set of base classes with box and label supervision.
To better adapt the holistic VLM for the fine-grained action detection task, we carefully fine-tune it on the localized video region-text pairs.
arXiv Detail & Related papers (2024-05-17T14:52:47Z) - Proposal-based Temporal Action Localization with Point-level Supervision [29.98225940694062]
Point-level supervised temporal action localization (PTAL) aims at recognizing and localizing actions in untrimmed videos.
We propose a novel method that localizes actions by generating and evaluating action proposals of flexible duration.
Experiments show that our proposed method achieves competitive or superior performance to the state-of-the-art methods.
arXiv Detail & Related papers (2023-10-09T08:27:05Z) - Active Learning with Effective Scoring Functions for Semi-Supervised
Temporal Action Localization [15.031156121516211]
This paper focuses on a rarely investigated yet practical task named semi-supervised TAL.
We propose an effective active learning method, named AL-STAL.
Experiment results show that AL-STAL outperforms the existing competitors and achieves satisfying performance compared with fully-supervised learning.
arXiv Detail & Related papers (2022-08-31T13:39:38Z) - Zero-Shot Temporal Action Detection via Vision-Language Prompting [134.26292288193298]
We propose a novel zero-Shot Temporal Action detection model via Vision-LanguagE prompting (STALE)
Our model significantly outperforms state-of-the-art alternatives.
Our model also yields superior results on supervised TAD over recent strong competitors.
arXiv Detail & Related papers (2022-07-17T13:59:46Z) - Semi-Supervised Temporal Action Detection with Proposal-Free Masking [134.26292288193298]
We propose a novel Semi-supervised Temporal action detection model based on PropOsal-free Temporal mask (SPOT)
SPOT outperforms state-of-the-art alternatives, often by a large margin.
arXiv Detail & Related papers (2022-07-14T16:58:47Z) - Towards High-Quality Temporal Action Detection with Sparse Proposals [14.923321325749196]
Temporal Action Detection aims to localize the temporal segments containing human action instances and predict the action categories.
We introduce Sparse Proposals to interact with the hierarchical features.
Experiments demonstrate the effectiveness of our method, especially under high tIoU thresholds.
arXiv Detail & Related papers (2021-09-18T06:15:19Z) - Temporal Action Localization Using Gated Recurrent Units [6.091096843566857]
We propose a new network based on Gated Recurrent Unit (GRU) and two novel post-processing ideas for TAL task.
Specifically, we propose a new design for the output layer of the GRU resulting in the so-called GRU-Splitted model.
We evaluate the performance of the proposed method compared to state-of-the-art methods.
arXiv Detail & Related papers (2021-08-07T06:25:29Z) - Learning Salient Boundary Feature for Anchor-free Temporal Action
Localization [81.55295042558409]
Temporal action localization is an important yet challenging task in video understanding.
We propose the first purely anchor-free temporal localization method.
Our model includes (i) an end-to-end trainable basic predictor, (ii) a saliency-based refinement module, and (iii) several consistency constraints.
arXiv Detail & Related papers (2021-03-24T12:28:32Z) - UniT: Unified Knowledge Transfer for Any-shot Object Detection and
Segmentation [52.487469544343305]
Methods for object detection and segmentation rely on large scale instance-level annotations for training.
We propose an intuitive and unified semi-supervised model that is applicable to a range of supervision.
arXiv Detail & Related papers (2020-06-12T22:45:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.