Single Shot Video Object Detector
- URL: http://arxiv.org/abs/2007.03560v1
- Date: Tue, 7 Jul 2020 15:36:26 GMT
- Title: Single Shot Video Object Detector
- Authors: Jiajun Deng and Yingwei Pan and Ting Yao and Wengang Zhou and Houqiang
Li and Tao Mei
- Abstract summary: Single Shot Video Object Detector (SSVD) is a new architecture that novelly integrates feature aggregation into a one-stage detector for object detection in videos.
For $448 times 448$ input, SSVD achieves 79.2% mAP on ImageNet VID dataset.
- Score: 215.06904478667337
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Single shot detectors that are potentially faster and simpler than two-stage
detectors tend to be more applicable to object detection in videos.
Nevertheless, the extension of such object detectors from image to video is not
trivial especially when appearance deterioration exists in videos, \emph{e.g.},
motion blur or occlusion. A valid question is how to explore temporal coherence
across frames for boosting detection. In this paper, we propose to address the
problem by enhancing per-frame features through aggregation of neighboring
frames. Specifically, we present Single Shot Video Object Detector (SSVD) -- a
new architecture that novelly integrates feature aggregation into a one-stage
detector for object detection in videos. Technically, SSVD takes Feature
Pyramid Network (FPN) as backbone network to produce multi-scale features.
Unlike the existing feature aggregation methods, SSVD, on one hand, estimates
the motion and aggregates the nearby features along the motion path, and on the
other, hallucinates features by directly sampling features from the adjacent
frames in a two-stream structure. Extensive experiments are conducted on
ImageNet VID dataset, and competitive results are reported when comparing to
state-of-the-art approaches. More remarkably, for $448 \times 448$ input, SSVD
achieves 79.2% mAP on ImageNet VID, by processing one frame in 85 ms on an
Nvidia Titan X Pascal GPU. The code is available at
\url{https://github.com/ddjiajun/SSVD}.
Related papers
- Practical Video Object Detection via Feature Selection and Aggregation [18.15061460125668]
Video object detection (VOD) needs to concern the high across-frame variation in object appearance, and the diverse deterioration in some frames.
Most of contemporary aggregation methods are tailored for two-stage detectors, suffering from high computational costs.
This study invents a very simple yet potent strategy of feature selection and aggregation, gaining significant accuracy at marginal computational expense.
arXiv Detail & Related papers (2024-07-29T02:12:11Z) - Spatio-temporal Prompting Network for Robust Video Feature Extraction [74.54597668310707]
Frametemporal is one of the main challenges in the field of video understanding.
Recent approaches exploit transformer-based integration modules to obtain quality-of-temporal information.
We present a neat and unified framework called N-Temporal Prompting Network (NNSTP)
It can efficiently extract video features by adjusting the input features in the network backbone.
arXiv Detail & Related papers (2024-02-04T17:52:04Z) - Multi-grained Temporal Prototype Learning for Few-shot Video Object
Segmentation [156.4142424784322]
Few-Shot Video Object (FSVOS) aims to segment objects in a query video with the same category defined by a few annotated support images.
We propose to leverage multi-grained temporal guidance information for handling the temporal correlation nature of video data.
Our proposed video IPMT model significantly outperforms previous models on two benchmark datasets.
arXiv Detail & Related papers (2023-09-20T09:16:34Z) - YOLOV: Making Still Image Object Detectors Great at Video Object
Detection [23.039968987772543]
Video object detection (VID) is challenging because of the high variation of object appearance and the diverse deterioration in some frames.
This work proposes a simple yet effective strategy to address the concerns, which spends marginal overheads with significant gains in accuracy.
Our YOLOX-based model can achieve promising performance (e.g., 87.5% AP50 at over 30 FPS on the ImageNet VID dataset on a single 2080Ti GPU)
arXiv Detail & Related papers (2022-08-20T14:12:06Z) - Spatial-Temporal Frequency Forgery Clue for Video Forgery Detection in
VIS and NIR Scenario [87.72258480670627]
Existing face forgery detection methods based on frequency domain find that the GAN forged images have obvious grid-like visual artifacts in the frequency spectrum compared to the real images.
This paper proposes a Cosine Transform-based Forgery Clue Augmentation Network (FCAN-DCT) to achieve a more comprehensive spatial-temporal feature representation.
arXiv Detail & Related papers (2022-07-05T09:27:53Z) - Real-Time and Accurate Object Detection in Compressed Video by Long
Short-term Feature Aggregation [30.73836337432833]
Video object detection is studied for pushing the limits of detection speed and accuracy.
To reduce the cost, we sparsely sample key frames in video and treat the rest frames are non-key frames.
A large and deep network is used to extract features for key frames and a tiny network is used for non-key frames.
The proposed video object detection network is evaluated on the large-scale ImageNet VID benchmark.
arXiv Detail & Related papers (2021-03-25T01:38:31Z) - CompFeat: Comprehensive Feature Aggregation for Video Instance
Segmentation [67.17625278621134]
Video instance segmentation is a complex task in which we need to detect, segment, and track each object for any given video.
Previous approaches only utilize single-frame features for the detection, segmentation, and tracking of objects.
We propose a novel comprehensive feature aggregation approach (CompFeat) to refine features at both frame-level and object-level with temporal and spatial context information.
arXiv Detail & Related papers (2020-12-07T00:31:42Z) - Fast Video Object Segmentation With Temporal Aggregation Network and
Dynamic Template Matching [67.02962970820505]
We introduce "tracking-by-detection" into Video Object (VOS)
We propose a new temporal aggregation network and a novel dynamic time-evolving template matching mechanism to achieve significantly improved performance.
We achieve new state-of-the-art performance on the DAVIS benchmark without complicated bells and whistles in both speed and accuracy, with a speed of 0.14 second per frame and J&F measure of 75.9% respectively.
arXiv Detail & Related papers (2020-07-11T05:44:16Z) - RN-VID: A Feature Fusion Architecture for Video Object Detection [10.667492516216889]
We propose RN-VID (standing for RetinaNet-VIDeo), a novel approach to video object detection.
First, we propose a new architecture that allows the usage of information from nearby frames to enhance feature maps.
Second, we propose a novel module to merge feature maps of same dimensions using re-ordering of channels and 1 x 1 convolutions.
arXiv Detail & Related papers (2020-03-24T14:54:46Z) - Plug & Play Convolutional Regression Tracker for Video Object Detection [37.47222104272429]
Video object detection targets to simultaneously localize the bounding boxes of the objects and identify their classes in a given video.
One challenge for video object detection is to consistently detect all objects across the whole video.
We propose a Plug & Play scale-adaptive convolutional regression tracker for the video object detection task.
arXiv Detail & Related papers (2020-03-02T15:57:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.