Spatio-Temporal Learnable Proposals for End-to-End Video Object
Detection
- URL: http://arxiv.org/abs/2210.02368v2
- Date: Fri, 7 Oct 2022 14:07:38 GMT
- Title: Spatio-Temporal Learnable Proposals for End-to-End Video Object
Detection
- Authors: Khurram Azeem Hashmi, Didier Stricker, Muhammamd Zeshan Afzal
- Abstract summary: We present SparseVOD, a novel video object detection pipeline that employs Sparse R-CNN to exploit temporal information.
Our method significantly improves the single-frame Sparse RCNN by 8%-9% in mAP.
- Score: 12.650574326251023
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This paper presents the novel idea of generating object proposals by
leveraging temporal information for video object detection. The feature
aggregation in modern region-based video object detectors heavily relies on
learned proposals generated from a single-frame RPN. This imminently introduces
additional components like NMS and produces unreliable proposals on low-quality
frames. To tackle these restrictions, we present SparseVOD, a novel video
object detection pipeline that employs Sparse R-CNN to exploit temporal
information. In particular, we introduce two modules in the dynamic head of
Sparse R-CNN. First, the Temporal Feature Extraction module based on the
Temporal RoI Align operation is added to extract the RoI proposal features.
Second, motivated by sequence-level semantic aggregation, we incorporate the
attention-guided Semantic Proposal Feature Aggregation module to enhance object
feature representation before detection. The proposed SparseVOD effectively
alleviates the overhead of complicated post-processing methods and makes the
overall pipeline end-to-end trainable. Extensive experiments show that our
method significantly improves the single-frame Sparse RCNN by 8%-9% in mAP.
Furthermore, besides achieving state-of-the-art 80.3% mAP on the ImageNet VID
dataset with ResNet-50 backbone, our SparseVOD outperforms existing
proposal-based methods by a significant margin on increasing IoU thresholds
(IoU > 0.5).
Related papers
- LR-FPN: Enhancing Remote Sensing Object Detection with Location Refined Feature Pyramid Network [2.028685490378346]
We propose a novel location refined feature pyramid network (LR-FPN) to enhance the extraction of shallow positional information.
Experiments on two large-scale remote sensing datasets demonstrate that the proposed LR-FPN is superior to state-of-the-art object detection approaches.
arXiv Detail & Related papers (2024-04-02T03:36:07Z) - TransVOD: End-to-end Video Object Detection with Spatial-Temporal
Transformers [96.981282736404]
We present TransVOD, the first end-to-end video object detection system based on spatial-temporal Transformer architectures.
Our proposed TransVOD++ sets a new state-of-the-art record in terms of accuracy on ImageNet VID with 90.0% mAP.
Our proposed TransVOD Lite also achieves the best speed and accuracy trade-off with 83.7% mAP while running at around 30 FPS.
arXiv Detail & Related papers (2022-01-13T16:17:34Z) - Exploring Motion and Appearance Information for Temporal Sentence
Grounding [52.01687915910648]
We propose a Motion-Appearance Reasoning Network (MARN) to solve temporal sentence grounding.
We develop separate motion and appearance branches to learn motion-guided and appearance-guided object relations.
Our proposed MARN significantly outperforms previous state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-01-03T02:44:18Z) - Video Salient Object Detection via Contrastive Features and Attention
Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection.
A co-attention formulation is utilized to combine the low-level and high-level features.
We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z) - A Generic Object Re-identification System for Short Videos [39.662850217144964]
A Temporal Information Fusion Network (TIFN) is proposed in the object detection module.
A Cross-Layer Pointwise Siamese Network (CPSN) is proposed in the tracking module to enhance the robustness of the appearance model.
Two challenge datasets containing real-world short videos are built for video object trajectory extraction and generic object re-identification.
arXiv Detail & Related papers (2021-02-10T05:45:09Z) - DS-Net: Dynamic Spatiotemporal Network for Video Salient Object
Detection [78.04869214450963]
We propose a novel dynamic temporal-temporal network (DSNet) for more effective fusion of temporal and spatial information.
We show that the proposed method achieves superior performance than state-of-the-art algorithms.
arXiv Detail & Related papers (2020-12-09T06:42:30Z) - Fast Video Object Segmentation With Temporal Aggregation Network and
Dynamic Template Matching [67.02962970820505]
We introduce "tracking-by-detection" into Video Object (VOS)
We propose a new temporal aggregation network and a novel dynamic time-evolving template matching mechanism to achieve significantly improved performance.
We achieve new state-of-the-art performance on the DAVIS benchmark without complicated bells and whistles in both speed and accuracy, with a speed of 0.14 second per frame and J&F measure of 75.9% respectively.
arXiv Detail & Related papers (2020-07-11T05:44:16Z) - LiDAR-based Online 3D Video Object Detection with Graph-based Message
Passing and Spatiotemporal Transformer Attention [100.52873557168637]
3D object detectors usually focus on the single-frame detection, while ignoring the information in consecutive point cloud frames.
In this paper, we propose an end-to-end online 3D video object detector that operates on point sequences.
arXiv Detail & Related papers (2020-04-03T06:06:52Z) - Spatio-temporal Tubelet Feature Aggregation and Object Linking in Videos [2.4923006485141284]
paper addresses the problem of how to exploittemporal information in available videos to improve the object classification.
We propose a two stage object detector called FANet based on short-term detection aggregation feature.
arXiv Detail & Related papers (2020-04-01T13:52:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.