Faster-TAD: Towards Temporal Action Detection with Proposal Generation
and Classification in a Unified Network
- URL: http://arxiv.org/abs/2204.02674v1
- Date: Wed, 6 Apr 2022 08:55:35 GMT
- Title: Faster-TAD: Towards Temporal Action Detection with Proposal Generation
and Classification in a Unified Network
- Authors: Shimin Chen, Chen Chen, Wei Li, Xunqiang Tao, Yandong Guo
- Abstract summary: Temporal action detection (TAD) aims to detect the semantic labels and boundaries of action instances in untrimmed videos.
We propose a unified network for TAD, termed Faster-TAD, by re-purposing a Faster-RCNN like architecture.
- Score: 13.03191060554677
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal action detection (TAD) aims to detect the semantic labels and
boundaries of action instances in untrimmed videos. Current mainstream
approaches are multi-step solutions, which fall short in efficiency and
flexibility. In this paper, we propose a unified network for TAD, termed
Faster-TAD, by re-purposing a Faster-RCNN like architecture. To tackle the
unique difficulty in TAD, we make important improvements over the original
framework. We propose a new Context-Adaptive Proposal Module and an innovative
Fake-Proposal Generation Block. What's more, we use atomic action features to
improve the performance. Faster-TAD simplifies the pipeline of TAD and gets
remarkable performance on lots of benchmarks, i.e., ActivityNet-1.3 (40.01%
mAP), HACS Segments (38.39% mAP), SoccerNet-Action Spotting (54.09% mAP). It
outperforms existing single-network detector by a large margin.
Related papers
- Temporal Action Detection with Global Segmentation Mask Learning [134.26292288193298]
Existing temporal action detection (TAD) methods rely on generating an overwhelmingly large number of proposals per video.
We propose a proposal-free Temporal Action detection model with Global mask (TAGS)
Our core idea is to learn a global segmentation mask of each action instance jointly at the full video length.
arXiv Detail & Related papers (2022-07-14T00:46:51Z) - ETAD: A Unified Framework for Efficient Temporal Action Detection [70.21104995731085]
Untrimmed video understanding such as temporal action detection (TAD) often suffers from the pain of huge demand for computing resources.
We build a unified framework for efficient end-to-end temporal action detection (ETAD)
ETAD achieves state-of-the-art performance on both THUMOS-14 and ActivityNet-1.3.
arXiv Detail & Related papers (2022-05-14T21:16:21Z) - Towards High-Quality Temporal Action Detection with Sparse Proposals [14.923321325749196]
Temporal Action Detection aims to localize the temporal segments containing human action instances and predict the action categories.
We introduce Sparse Proposals to interact with the hierarchical features.
Experiments demonstrate the effectiveness of our method, especially under high tIoU thresholds.
arXiv Detail & Related papers (2021-09-18T06:15:19Z) - Efficient Person Search: An Anchor-Free Approach [86.45858994806471]
Person search aims to simultaneously localize and identify a query person from realistic, uncropped images.
To achieve this goal, state-of-the-art models typically add a re-id branch upon two-stage detectors like Faster R-CNN.
In this work, we present an anchor-free approach to efficiently tackling this challenging task, by introducing the following dedicated designs.
arXiv Detail & Related papers (2021-09-01T07:01:33Z) - End-to-end Temporal Action Detection with Transformer [86.80289146697788]
Temporal action detection (TAD) aims to determine the semantic label and the boundaries of every action instance in an untrimmed video.
Here, we construct an end-to-end framework for TAD upon Transformer, termed textitTadTR.
Our method achieves state-of-the-art performance on HACS Segments and THUMOS14 and competitive performance on ActivityNet-1.3.
arXiv Detail & Related papers (2021-06-18T17:58:34Z) - Learning Salient Boundary Feature for Anchor-free Temporal Action
Localization [81.55295042558409]
Temporal action localization is an important yet challenging task in video understanding.
We propose the first purely anchor-free temporal localization method.
Our model includes (i) an end-to-end trainable basic predictor, (ii) a saliency-based refinement module, and (iii) several consistency constraints.
arXiv Detail & Related papers (2021-03-24T12:28:32Z) - Decoupled and Memory-Reinforced Networks: Towards Effective Feature
Learning for One-Step Person Search [65.51181219410763]
One-step methods have been developed to handle pedestrian detection and identification sub-tasks using a single network.
There are two major challenges in the current one-step approaches.
We propose a decoupled and memory-reinforced network (DMRNet) to overcome these problems.
arXiv Detail & Related papers (2021-02-22T06:19:45Z) - Relaxed Transformer Decoders for Direct Action Proposal Generation [30.516462193231888]
This paper presents a simple and end-to-end learnable framework (RTD-Net) for direct action proposal generation.
To tackle the essential visual difference between time and space, we make three important improvements over the original transformer detection framework (DETR)
Experiments on THUMOS14 and ActivityNet-1.3 benchmarks demonstrate the effectiveness of RTD-Net.
arXiv Detail & Related papers (2021-02-03T06:29:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.