Real-Time and Accurate Object Detection in Compressed Video by Long
Short-term Feature Aggregation
- URL: http://arxiv.org/abs/2103.14529v1
- Date: Thu, 25 Mar 2021 01:38:31 GMT
- Title: Real-Time and Accurate Object Detection in Compressed Video by Long
Short-term Feature Aggregation
- Authors: Xinggang Wang, Zhaojin Huang, Bencheng Liao, Lichao Huang, Yongchao
Gong, Chang Huang
- Abstract summary: Video object detection is studied for pushing the limits of detection speed and accuracy.
To reduce the cost, we sparsely sample key frames in video and treat the rest frames are non-key frames.
A large and deep network is used to extract features for key frames and a tiny network is used for non-key frames.
The proposed video object detection network is evaluated on the large-scale ImageNet VID benchmark.
- Score: 30.73836337432833
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video object detection is a fundamental problem in computer vision and has a
wide spectrum of applications. Based on deep networks, video object detection
is actively studied for pushing the limits of detection speed and accuracy. To
reduce the computation cost, we sparsely sample key frames in video and treat
the rest frames are non-key frames; a large and deep network is used to extract
features for key frames and a tiny network is used for non-key frames. To
enhance the features of non-key frames, we propose a novel short-term feature
aggregation method to propagate the rich information in key frame features to
non-key frame features in a fast way. The fast feature aggregation is enabled
by the freely available motion cues in compressed videos. Further, key frame
features are also aggregated based on optical flow. The propagated deep
features are then integrated with the directly extracted features for object
detection. The feature extraction and feature integration parameters are
optimized in an end-to-end manner. The proposed video object detection network
is evaluated on the large-scale ImageNet VID benchmark and achieves 77.2\% mAP,
which is on-par with state-of-the-art accuracy, at the speed of 30 FPS using a
Titan X GPU. The source codes are available at
\url{https://github.com/hustvl/LSFA}.
Related papers
- Spatio-temporal Prompting Network for Robust Video Feature Extraction [74.54597668310707]
Frametemporal is one of the main challenges in the field of video understanding.
Recent approaches exploit transformer-based integration modules to obtain quality-of-temporal information.
We present a neat and unified framework called N-Temporal Prompting Network (NNSTP)
It can efficiently extract video features by adjusting the input features in the network backbone.
arXiv Detail & Related papers (2024-02-04T17:52:04Z) - A Spatial-Temporal Deformable Attention based Framework for Breast
Lesion Detection in Videos [107.96514633713034]
We propose a spatial-temporal deformable attention based framework, named STNet.
Our STNet introduces a spatial-temporal deformable attention module to perform local spatial-temporal feature fusion.
Experiments on the public breast lesion ultrasound video dataset show that our STNet obtains a state-of-the-art detection performance.
arXiv Detail & Related papers (2023-09-09T07:00:10Z) - Key Frame Extraction with Attention Based Deep Neural Networks [0.0]
We propose a deep learning-based approach for detection using a deep auto-encoder model with an attention layer.
The proposed method first extracts the features from the video frames using the encoder part of the autoencoder and applies segmentation using the k-means algorithm to group these features and similar frames together.
The method was evaluated on the TVSUM clustering video dataset and achieved a classification accuracy of 0.77, indicating a higher success rate than many existing methods.
arXiv Detail & Related papers (2023-06-21T15:09:37Z) - VNVC: A Versatile Neural Video Coding Framework for Efficient
Human-Machine Vision [59.632286735304156]
It is more efficient to enhance/analyze the coded representations directly without decoding them into pixels.
We propose a versatile neural video coding (VNVC) framework, which targets learning compact representations to support both reconstruction and direct enhancement/analysis.
arXiv Detail & Related papers (2023-06-19T03:04:57Z) - You Can Ground Earlier than See: An Effective and Efficient Pipeline for
Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query.
Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames.
We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z) - Video Salient Object Detection via Contrastive Features and Attention
Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection.
A co-attention formulation is utilized to combine the low-level and high-level features.
We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z) - Single Shot Video Object Detector [215.06904478667337]
Single Shot Video Object Detector (SSVD) is a new architecture that novelly integrates feature aggregation into a one-stage detector for object detection in videos.
For $448 times 448$ input, SSVD achieves 79.2% mAP on ImageNet VID dataset.
arXiv Detail & Related papers (2020-07-07T15:36:26Z) - Plug & Play Convolutional Regression Tracker for Video Object Detection [37.47222104272429]
Video object detection targets to simultaneously localize the bounding boxes of the objects and identify their classes in a given video.
One challenge for video object detection is to consistently detect all objects across the whole video.
We propose a Plug & Play scale-adaptive convolutional regression tracker for the video object detection task.
arXiv Detail & Related papers (2020-03-02T15:57:55Z) - Pack and Detect: Fast Object Detection in Videos Using Region-of-Interest Packing [15.162117090697006]
We propose Pack and Detect, an approach to reduce the computational requirements of object detection in videos.
Experiments using the ImageNet video object detection dataset indicate that PaD can potentially reduce the number of FLOPS required for a frame by $4times$.
arXiv Detail & Related papers (2018-09-05T19:29:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.