RN-VID: A Feature Fusion Architecture for Video Object Detection
- URL: http://arxiv.org/abs/2003.10898v2
- Date: Thu, 2 Apr 2020 15:53:28 GMT
- Title: RN-VID: A Feature Fusion Architecture for Video Object Detection
- Authors: Hughes Perreault, Maguelonne H\'eritier, Pierre Gravel,
Guillaume-Alexandre Bilodeau and Nicolas Saunier
- Abstract summary: We propose RN-VID (standing for RetinaNet-VIDeo), a novel approach to video object detection.
First, we propose a new architecture that allows the usage of information from nearby frames to enhance feature maps.
Second, we propose a novel module to merge feature maps of same dimensions using re-ordering of channels and 1 x 1 convolutions.
- Score: 10.667492516216889
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Consecutive frames in a video are highly redundant. Therefore, to perform the
task of video object detection, executing single frame detectors on every frame
without reusing any information is quite wasteful. It is with this idea in mind
that we propose RN-VID (standing for RetinaNet-VIDeo), a novel approach to
video object detection. Our contributions are twofold. First, we propose a new
architecture that allows the usage of information from nearby frames to enhance
feature maps. Second, we propose a novel module to merge feature maps of same
dimensions using re-ordering of channels and 1 x 1 convolutions. We then
demonstrate that RN-VID achieves better mean average precision (mAP) than
corresponding single frame detectors with little additional cost during
inference.
Related papers
- STF: Spatio-Temporal Fusion Module for Improving Video Object Detection [7.213855322671065]
Consive frames in a video contain redundancy, but they may also contain complementary information for the detection task.
We propose a-temporal fusion framework (STF) to leverage this complementary information.
The proposed-temporal fusion module leads to improved detection performance compared to baseline object detectors.
arXiv Detail & Related papers (2024-02-16T15:19:39Z) - YOLOV: Making Still Image Object Detectors Great at Video Object
Detection [23.039968987772543]
Video object detection (VID) is challenging because of the high variation of object appearance and the diverse deterioration in some frames.
This work proposes a simple yet effective strategy to address the concerns, which spends marginal overheads with significant gains in accuracy.
Our YOLOX-based model can achieve promising performance (e.g., 87.5% AP50 at over 30 FPS on the ImageNet VID dataset on a single 2080Ti GPU)
arXiv Detail & Related papers (2022-08-20T14:12:06Z) - Correspondence Matters for Video Referring Expression Comprehension [64.60046797561455]
Video Referring Expression (REC) aims to localize the referent objects described in the sentence to visual regions in the video frames.
Existing methods suffer from two problems: 1) inconsistent localization results across video frames; 2) confusion between the referent and contextual objects.
We propose a novel Dual Correspondence Network (dubbed as DCNet) which explicitly enhances the dense associations in both the inter-frame and cross-modal manners.
arXiv Detail & Related papers (2022-07-21T10:31:39Z) - Exploring Motion and Appearance Information for Temporal Sentence
Grounding [52.01687915910648]
We propose a Motion-Appearance Reasoning Network (MARN) to solve temporal sentence grounding.
We develop separate motion and appearance branches to learn motion-guided and appearance-guided object relations.
Our proposed MARN significantly outperforms previous state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-01-03T02:44:18Z) - FFAVOD: Feature Fusion Architecture for Video Object Detection [11.365829102707014]
We propose FFAVOD, standing for feature fusion architecture for video object detection.
We first introduce a novel video object detection architecture that allows a network to share feature maps between nearby frames.
We show that using the proposed architecture and the fusion module can improve the performance of three base object detectors on two object detection benchmarks containing sequences of moving road users.
arXiv Detail & Related papers (2021-09-15T13:53:21Z) - Full-Duplex Strategy for Video Object Segmentation [141.43983376262815]
Full- Strategy Network (FSNet) is a novel framework for video object segmentation (VOS)
Our FSNet performs the crossmodal feature-passing (i.e., transmission and receiving) simultaneously before fusion decoding stage.
We show that our FSNet outperforms other state-of-the-arts for both the VOS and video salient object detection tasks.
arXiv Detail & Related papers (2021-08-06T14:50:50Z) - Multiview Detection with Feature Perspective Transformation [59.34619548026885]
We propose a novel multiview detection system, MVDet.
We take an anchor-free approach to aggregate multiview information by projecting feature maps onto the ground plane.
Our entire model is end-to-end learnable and achieves 88.2% MODA on the standard Wildtrack dataset.
arXiv Detail & Related papers (2020-07-14T17:58:30Z) - Single Shot Video Object Detector [215.06904478667337]
Single Shot Video Object Detector (SSVD) is a new architecture that novelly integrates feature aggregation into a one-stage detector for object detection in videos.
For $448 times 448$ input, SSVD achieves 79.2% mAP on ImageNet VID dataset.
arXiv Detail & Related papers (2020-07-07T15:36:26Z) - Zero-Shot Video Object Segmentation via Attentive Graph Neural Networks [150.5425122989146]
This work proposes a novel attentive graph neural network (AGNN) for zero-shot video object segmentation (ZVOS)
AGNN builds a fully connected graph to efficiently represent frames as nodes, and relations between arbitrary frame pairs as edges.
Experimental results on three video segmentation datasets show that AGNN sets a new state-of-the-art in each case.
arXiv Detail & Related papers (2020-01-19T10:45:27Z) - Pack and Detect: Fast Object Detection in Videos Using Region-of-Interest Packing [15.162117090697006]
We propose Pack and Detect, an approach to reduce the computational requirements of object detection in videos.
Experiments using the ImageNet video object detection dataset indicate that PaD can potentially reduce the number of FLOPS required for a frame by $4times$.
arXiv Detail & Related papers (2018-09-05T19:29:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.