ETAD: A Unified Framework for Efficient Temporal Action Detection
- URL: http://arxiv.org/abs/2205.07134v1
- Date: Sat, 14 May 2022 21:16:21 GMT
- Title: ETAD: A Unified Framework for Efficient Temporal Action Detection
- Authors: Shuming Liu, Mengmeng Xu, Chen Zhao, Xu Zhao, Bernard Ghanem
- Abstract summary: Untrimmed video understanding such as temporal action detection (TAD) often suffers from the pain of huge demand for computing resources.
We build a unified framework for efficient end-to-end temporal action detection (ETAD)
ETAD achieves state-of-the-art performance on both THUMOS-14 and ActivityNet-1.3.
- Score: 70.21104995731085
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Untrimmed video understanding such as temporal action detection (TAD) often
suffers from the pain of huge demand for computing resources. Because of long
video durations and limited GPU memory, most action detectors can only operate
on pre-extracted features rather than the original videos, and they still
require a lot of computation to achieve high detection performance. To
alleviate the heavy computation problem in TAD, in this work, we first propose
an efficient action detector with detector proposal sampling, based on the
observation that performance saturates at a small number of proposals. This
detector is designed with several important techniques, such as LSTM-boosted
temporal aggregation and cascaded proposal refinement to achieve high detection
quality as well as low computational cost. To enable joint optimization of this
action detector and the feature encoder, we also propose encoder gradient
sampling, which selectively back-propagates through video snippets and
tremendously reduces GPU memory consumption. With the two sampling strategies
and the effective detector, we build a unified framework for efficient
end-to-end temporal action detection (ETAD), making real-world untrimmed video
understanding tractable. ETAD achieves state-of-the-art performance on both
THUMOS-14 and ActivityNet-1.3. Interestingly, on ActivityNet-1.3, it reaches
37.78% average mAP, while only requiring 6 mins of training time and 1.23 GB
memory based on pre-extracted features. With end-to-end training, it reduces
the GPU memory footprint by more than 70% with even higher performance (38.21%
average mAP), as compared with traditional end-to-end methods. The code is
available at https://github.com/sming256/ETAD.
Related papers
- Intelligent Video Recording Optimization using Activity Detection for Surveillance Systems [0.0]
This paper proposes an optimized video recording solution focused on activity detection.
The proposed approach utilizes a hybrid method that combines motion detection via frame subtraction with object detection using YOLOv9.
The developed model demonstrates superior performance, achieving precision metrics of 0.855 for car detection and 0.884 for person detection.
arXiv Detail & Related papers (2024-11-04T21:44:03Z) - Practical Video Object Detection via Feature Selection and Aggregation [18.15061460125668]
Video object detection (VOD) needs to concern the high across-frame variation in object appearance, and the diverse deterioration in some frames.
Most of contemporary aggregation methods are tailored for two-stage detectors, suffering from high computational costs.
This study invents a very simple yet potent strategy of feature selection and aggregation, gaining significant accuracy at marginal computational expense.
arXiv Detail & Related papers (2024-07-29T02:12:11Z) - Efficient One-stage Video Object Detection by Exploiting Temporal
Consistency [35.16197118579414]
One-stage detectors have achieved competitive accuracy and faster speed compared with traditional two-stage detectors on image data.
In this paper, we first analyse the computational bottlenecks of using one-stage detectors for video object detection.
We present a simple yet efficient framework to address the computational bottlenecks and achieve efficient one-stage VOD.
arXiv Detail & Related papers (2024-02-14T15:32:07Z) - TempNet: Temporal Attention Towards the Detection of Animal Behaviour in
Videos [63.85815474157357]
We propose an efficient computer vision- and deep learning-based method for the detection of biological behaviours in videos.
TempNet uses an encoder bridge and residual blocks to maintain model performance with a two-staged, spatial, then temporal, encoder.
We demonstrate its application to the detection of sablefish (Anoplopoma fimbria) startle events.
arXiv Detail & Related papers (2022-11-17T23:55:12Z) - It Takes Two: Masked Appearance-Motion Modeling for Self-supervised
Video Transformer Pre-training [76.69480467101143]
Self-supervised video transformer pre-training has recently benefited from the mask-and-predict pipeline.
We explicitly investigate motion cues in videos as extra prediction target and propose our Masked Appearance-Motion Modeling framework.
Our method learns generalized video representations and achieves 82.3% on Kinects-400, 71.3% on Something-Something V2, 91.5% on UCF101, and 62.5% on HMDB51.
arXiv Detail & Related papers (2022-10-11T08:05:18Z) - SALISA: Saliency-based Input Sampling for Efficient Video Object
Detection [58.22508131162269]
We propose SALISA, a novel non-uniform SALiency-based Input SAmpling technique for video object detection.
We show that SALISA significantly improves the detection of small objects.
arXiv Detail & Related papers (2022-04-05T17:59:51Z) - Motion Vector Extrapolation for Video Object Detection [0.0]
MOVEX enables low latency video object detection on common CPU based systems.
We show that our approach significantly reduces the baseline latency of any given object detector.
Further latency reduction, up to 25x lower than the original latency, can be achieved with minimal accuracy loss.
arXiv Detail & Related papers (2021-04-18T17:26:37Z) - Finding Action Tubes with a Sparse-to-Dense Framework [62.60742627484788]
We propose a framework that generates action tube proposals from video streams with a single forward pass in a sparse-to-dense manner.
We evaluate the efficacy of our model on the UCF101-24, JHMDB-21 and UCFSports benchmark datasets.
arXiv Detail & Related papers (2020-08-30T15:38:44Z) - SADet: Learning An Efficient and Accurate Pedestrian Detector [68.66857832440897]
This paper proposes a series of systematic optimization strategies for the detection pipeline of one-stage detector.
It forms a single shot anchor-based detector (SADet) for efficient and accurate pedestrian detection.
Though structurally simple, it presents state-of-the-art result and real-time speed of $20$ FPS for VGA-resolution images.
arXiv Detail & Related papers (2020-07-26T12:32:38Z) - Joint Detection and Tracking in Videos with Identification Features [36.55599286568541]
We propose the first joint optimization of detection, tracking and re-identification features for videos.
Our method reaches the state-of-the-art on MOT, it ranks 1st in the UA-DETRAC'18 tracking challenge among online trackers, and 3rd overall.
arXiv Detail & Related papers (2020-05-21T21:06:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.