CFAD: Coarse-to-Fine Action Detector for Spatiotemporal Action
Localization
- URL: http://arxiv.org/abs/2008.08332v1
- Date: Wed, 19 Aug 2020 08:47:50 GMT
- Title: CFAD: Coarse-to-Fine Action Detector for Spatiotemporal Action
Localization
- Authors: Yuxi Li, Weiyao Lin, John See, Ning Xu, Shugong Xu, Ke Yan and Cong
Yang
- Abstract summary: We propose Coarse-to-Fine Action Detector (CFAD) for efficient action localization.
CFAD first estimates coarse tubes-temporal action tubes from video streams, and then refines location based on key timestamps.
- Score: 42.95186231216036
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Most current pipelines for spatio-temporal action localization connect
frame-wise or clip-wise detection results to generate action proposals, where
only local information is exploited and the efficiency is hindered by dense
per-frame localization. In this paper, we propose Coarse-to-Fine Action
Detector (CFAD),an original end-to-end trainable framework for efficient
spatio-temporal action localization. The CFAD introduces a new paradigm that
first estimates coarse spatio-temporal action tubes from video streams, and
then refines the tubes' location based on key timestamps. This concept is
implemented by two key components, the Coarse and Refine Modules in our
framework. The parameterized modeling of long temporal information in the
Coarse Module helps obtain accurate initial tube estimation, while the Refine
Module selectively adjusts the tube location under the guidance of key
timestamps. Against other methods, theproposed CFAD achieves competitive
results on action detection benchmarks of UCF101-24, UCFSports and JHMDB-21
with inference speed that is 3.3x faster than the nearest competitors.
Related papers
- Proposal-based Temporal Action Localization with Point-level Supervision [29.98225940694062]
Point-level supervised temporal action localization (PTAL) aims at recognizing and localizing actions in untrimmed videos.
We propose a novel method that localizes actions by generating and evaluating action proposals of flexible duration.
Experiments show that our proposed method achieves competitive or superior performance to the state-of-the-art methods.
arXiv Detail & Related papers (2023-10-09T08:27:05Z) - An Efficient Spatio-Temporal Pyramid Transformer for Action Detection [40.68615998427292]
We present an efficient hierarchical Spatio-Temporal Pyramid Transformer (STPT) video framework for action detection.
Specifically, we propose to use local window attention to encode local-temporal rich-time representations in the early stages while applying global attention to capture long-term space-time dependencies in the later stages.
In this way, ourSTPT can encode both locality and dependency with largely reduced redundancy, delivering a promising trade-off between accuracy and efficiency.
arXiv Detail & Related papers (2022-07-21T12:38:05Z) - Exploring Stronger Feature for Temporal Action Localization [41.23726979184197]
Temporal action localization aims to localize starting and ending time with action category.
We explored classic convolution-based backbones and the recent surge of transformer-based backbones.
We achieve 42.42% in terms of mAP on validation set with a single SlowFast feature by a simple combination.
arXiv Detail & Related papers (2021-06-24T13:46:30Z) - Temporal Context Aggregation Network for Temporal Action Proposal
Refinement [93.03730692520999]
Temporal action proposal generation is a challenging yet important task in the video understanding field.
Current methods still suffer from inaccurate temporal boundaries and inferior confidence used for retrieval.
We propose TCANet to generate high-quality action proposals through "local and global" temporal context aggregation.
arXiv Detail & Related papers (2021-03-24T12:34:49Z) - Learning Salient Boundary Feature for Anchor-free Temporal Action
Localization [81.55295042558409]
Temporal action localization is an important yet challenging task in video understanding.
We propose the first purely anchor-free temporal localization method.
Our model includes (i) an end-to-end trainable basic predictor, (ii) a saliency-based refinement module, and (iii) several consistency constraints.
arXiv Detail & Related papers (2021-03-24T12:28:32Z) - Finding Action Tubes with a Sparse-to-Dense Framework [62.60742627484788]
We propose a framework that generates action tube proposals from video streams with a single forward pass in a sparse-to-dense manner.
We evaluate the efficacy of our model on the UCF101-24, JHMDB-21 and UCFSports benchmark datasets.
arXiv Detail & Related papers (2020-08-30T15:38:44Z) - Revisiting Anchor Mechanisms for Temporal Action Localization [126.96340233561418]
This paper proposes a novel anchor-free action localization module that assists action localization by temporal points.
By combining the proposed anchor-free module with a conventional anchor-based module, we propose a novel action localization framework, called A2Net.
The cooperation between anchor-free and anchor-based modules achieves superior performance to the state-of-the-art on THUMOS14.
arXiv Detail & Related papers (2020-08-22T13:39:29Z) - Actions as Moving Points [66.21507857877756]
We present a conceptually simple, efficient, and more precise action tubelet detection framework, termed as MovingCenter Detector (MOC-detector)
Based on the insight that movement information could simplify and assist action tubelet detection, our MOC-detector is composed of three crucial head branches.
Our MOC-detector outperforms the existing state-of-the-art methods for both metrics of frame-mAP and video-mAP on the JHMDB and UCF101-24 datasets.
arXiv Detail & Related papers (2020-01-14T03:29:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.