Efficient One-stage Video Object Detection by Exploiting Temporal
Consistency
- URL: http://arxiv.org/abs/2402.09241v1
- Date: Wed, 14 Feb 2024 15:32:07 GMT
- Title: Efficient One-stage Video Object Detection by Exploiting Temporal
Consistency
- Authors: Guanxiong Sun, Yang Hua, Guosheng Hu, Neil Robertson
- Abstract summary: One-stage detectors have achieved competitive accuracy and faster speed compared with traditional two-stage detectors on image data.
In this paper, we first analyse the computational bottlenecks of using one-stage detectors for video object detection.
We present a simple yet efficient framework to address the computational bottlenecks and achieve efficient one-stage VOD.
- Score: 35.16197118579414
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, one-stage detectors have achieved competitive accuracy and faster
speed compared with traditional two-stage detectors on image data. However, in
the field of video object detection (VOD), most existing VOD methods are still
based on two-stage detectors. Moreover, directly adapting existing VOD methods
to one-stage detectors introduces unaffordable computational costs. In this
paper, we first analyse the computational bottlenecks of using one-stage
detectors for VOD. Based on the analysis, we present a simple yet efficient
framework to address the computational bottlenecks and achieve efficient
one-stage VOD by exploiting the temporal consistency in video frames.
Specifically, our method consists of a location-prior network to filter out
background regions and a size-prior network to skip unnecessary computations on
low-level feature maps for specific frames. We test our method on various
modern one-stage detectors and conduct extensive experiments on the ImageNet
VID dataset. Excellent experimental results demonstrate the superior
effectiveness, efficiency, and compatibility of our method. The code is
available at https://github.com/guanxiongsun/vfe.pytorch.
Related papers
- Practical Video Object Detection via Feature Selection and Aggregation [18.15061460125668]
Video object detection (VOD) needs to concern the high across-frame variation in object appearance, and the diverse deterioration in some frames.
Most of contemporary aggregation methods are tailored for two-stage detectors, suffering from high computational costs.
This study invents a very simple yet potent strategy of feature selection and aggregation, gaining significant accuracy at marginal computational expense.
arXiv Detail & Related papers (2024-07-29T02:12:11Z) - DOAD: Decoupled One Stage Action Detection Network [77.14883592642782]
Localizing people and recognizing their actions from videos is a challenging task towards high-level video understanding.
Existing methods are mostly two-stage based, with one stage for person bounding box generation and the other stage for action recognition.
We present a decoupled one-stage network dubbed DOAD, to improve the efficiency for-temporal action detection.
arXiv Detail & Related papers (2023-04-01T08:06:43Z) - TempNet: Temporal Attention Towards the Detection of Animal Behaviour in
Videos [63.85815474157357]
We propose an efficient computer vision- and deep learning-based method for the detection of biological behaviours in videos.
TempNet uses an encoder bridge and residual blocks to maintain model performance with a two-staged, spatial, then temporal, encoder.
We demonstrate its application to the detection of sablefish (Anoplopoma fimbria) startle events.
arXiv Detail & Related papers (2022-11-17T23:55:12Z) - YOLOV: Making Still Image Object Detectors Great at Video Object
Detection [23.039968987772543]
Video object detection (VID) is challenging because of the high variation of object appearance and the diverse deterioration in some frames.
This work proposes a simple yet effective strategy to address the concerns, which spends marginal overheads with significant gains in accuracy.
Our YOLOX-based model can achieve promising performance (e.g., 87.5% AP50 at over 30 FPS on the ImageNet VID dataset on a single 2080Ti GPU)
arXiv Detail & Related papers (2022-08-20T14:12:06Z) - NSNet: Non-saliency Suppression Sampler for Efficient Video Recognition [89.84188594758588]
A novel Non-saliency Suppression Network (NSNet) is proposed to suppress the responses of non-salient frames.
NSNet achieves the state-of-the-art accuracy-efficiency trade-off and presents a significantly faster (2.44.3x) practical inference speed than state-of-the-art methods.
arXiv Detail & Related papers (2022-07-21T09:41:22Z) - ETAD: A Unified Framework for Efficient Temporal Action Detection [70.21104995731085]
Untrimmed video understanding such as temporal action detection (TAD) often suffers from the pain of huge demand for computing resources.
We build a unified framework for efficient end-to-end temporal action detection (ETAD)
ETAD achieves state-of-the-art performance on both THUMOS-14 and ActivityNet-1.3.
arXiv Detail & Related papers (2022-05-14T21:16:21Z) - Implicit Motion Handling for Video Camouflaged Object Detection [60.98467179649398]
We propose a new video camouflaged object detection (VCOD) framework.
It can exploit both short-term and long-term temporal consistency to detect camouflaged objects from video frames.
arXiv Detail & Related papers (2022-03-14T17:55:41Z) - Motion Vector Extrapolation for Video Object Detection [0.0]
MOVEX enables low latency video object detection on common CPU based systems.
We show that our approach significantly reduces the baseline latency of any given object detector.
Further latency reduction, up to 25x lower than the original latency, can be achieved with minimal accuracy loss.
arXiv Detail & Related papers (2021-04-18T17:26:37Z) - Efficient Two-Stream Network for Violence Detection Using Separable
Convolutional LSTM [0.0]
We propose an efficient two-stream deep learning architecture leveraging Separable Convolutional LSTM (SepConvLSTM) and pre-trained MobileNet.
SepConvLSTM is constructed by replacing convolution operation at each gate of ConvLSTM with a depthwise separable convolution.
Our model outperforms the accuracy on the larger and more challenging RWF-2000 dataset by more than a 2% margin.
arXiv Detail & Related papers (2021-02-21T12:01:48Z) - Finding Action Tubes with a Sparse-to-Dense Framework [62.60742627484788]
We propose a framework that generates action tube proposals from video streams with a single forward pass in a sparse-to-dense manner.
We evaluate the efficacy of our model on the UCF101-24, JHMDB-21 and UCFSports benchmark datasets.
arXiv Detail & Related papers (2020-08-30T15:38:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.