An Empirical Study of End-to-End Temporal Action Detection
- URL: http://arxiv.org/abs/2204.02932v1
- Date: Wed, 6 Apr 2022 16:46:30 GMT
- Title: An Empirical Study of End-to-End Temporal Action Detection
- Authors: Xiaolong Liu, Song Bai, Xiang Bai
- Abstract summary: Temporal action detection (TAD) is an important yet challenging task in video understanding.
Rather than end-to-end learning, most existing methods adopt a head-only learning paradigm.
We validate the advantage of end-to-end learning over head-only learning and observe up to 11% performance improvement.
- Score: 82.64373812690127
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Temporal action detection (TAD) is an important yet challenging task in video
understanding. It aims to simultaneously predict the semantic label and the
temporal interval of every action instance in an untrimmed video. Rather than
end-to-end learning, most existing methods adopt a head-only learning paradigm,
where the video encoder is pre-trained for action classification, and only the
detection head upon the encoder is optimized for TAD. The effect of end-to-end
learning is not systematically evaluated. Besides, there lacks an in-depth
study on the efficiency-accuracy trade-off in end-to-end TAD. In this paper, we
present an empirical study of end-to-end temporal action detection. We validate
the advantage of end-to-end learning over head-only learning and observe up to
11\% performance improvement. Besides, we study the effects of multiple design
choices that affect the TAD performance and speed, including detection head,
video encoder, and resolution of input videos. Based on the findings, we build
a mid-resolution baseline detector, which achieves the state-of-the-art
performance of end-to-end methods while running more than 4$\times$ faster. We
hope that this paper can serve as a guide for end-to-end learning and inspire
future research in this field. Code and models are available at
\url{https://github.com/xlliu7/E2E-TAD}.
Related papers
- Deep Learning for Video Anomaly Detection: A Review [52.74513211976795]
Video anomaly detection (VAD) aims to discover behaviors or events deviating from the normality in videos.
In the era of deep learning, a great variety of deep learning based methods are constantly emerging for the VAD task.
This review covers the spectrum of five different categories, namely, semi-supervised, weakly supervised, fully supervised, unsupervised and open-set supervised VAD.
arXiv Detail & Related papers (2024-09-09T07:31:16Z) - Early Action Recognition with Action Prototypes [62.826125870298306]
We propose a novel model that learns a prototypical representation of the full action for each class.
We decompose the video into short clips, where a visual encoder extracts features from each clip independently.
Later, a decoder aggregates together in an online fashion features from all the clips for the final class prediction.
arXiv Detail & Related papers (2023-12-11T18:31:13Z) - End-to-End Semi-Supervised Learning for Video Action Detection [23.042410033982193]
We propose a simple end-to-end based approach effectively which utilizes the unlabeled data.
Video action detection requires both, action class prediction as well as a-temporal consistency.
We demonstrate the effectiveness of the proposed approach on two different action detection benchmark datasets.
arXiv Detail & Related papers (2022-03-08T18:11:25Z) - Privileged Knowledge Distillation for Online Action Detection [114.5213840651675]
Online Action Detection (OAD) in videos is proposed as a per-frame labeling task to address the real-time prediction tasks.
This paper presents a novel learning-with-privileged based framework for online action detection where the future frames only observable at the training stages are considered as a form of privileged information.
arXiv Detail & Related papers (2020-11-18T08:52:15Z) - Self-supervised Temporal Discriminative Learning for Video
Representation Learning [39.43942923911425]
Temporal-discriminative features can hardly be extracted without using an annotated large-scale video action dataset for training.
This paper proposes a novel Video-based Temporal-Discriminative Learning framework in self-supervised manner.
arXiv Detail & Related papers (2020-08-05T13:36:59Z) - ZSTAD: Zero-Shot Temporal Activity Detection [107.63759089583382]
We propose a novel task setting called zero-shot temporal activity detection (ZSTAD), where activities that have never been seen in training can still be detected.
We design an end-to-end deep network based on R-C3D as the architecture for this solution.
Experiments on both the THUMOS14 and the Charades datasets show promising performance in terms of detecting unseen activities.
arXiv Detail & Related papers (2020-03-12T02:40:36Z) - Delving into 3D Action Anticipation from Streaming Videos [99.0155538452263]
Action anticipation aims to recognize the action with a partial observation.
We introduce several complementary evaluation metrics and present a basic model based on frame-wise action classification.
We also explore multi-task learning strategies by incorporating auxiliary information from two aspects: the full action representation and the class-agnostic action label.
arXiv Detail & Related papers (2019-06-15T10:30:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.