Predict to Detect: Prediction-guided 3D Object Detection using
Sequential Images
- URL: http://arxiv.org/abs/2306.08528v3
- Date: Tue, 5 Sep 2023 05:35:31 GMT
- Title: Predict to Detect: Prediction-guided 3D Object Detection using
Sequential Images
- Authors: Sanmin Kim, Youngseok Kim, In-Jae Lee, Dongsuk Kum
- Abstract summary: We propose a novel 3D object detection model, P2D (Predict to Detect), that integrates a prediction scheme into a detection framework.
P2D predicts object information in the current frame using solely past frames to learn temporal motion features.
We then introduce a novel temporal feature aggregation method that attentively exploits Bird's-Eye-View (BEV) features based on predicted object information.
- Score: 15.51093009875854
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent camera-based 3D object detection methods have introduced sequential
frames to improve the detection performance hoping that multiple frames would
mitigate the large depth estimation error. Despite improved detection
performance, prior works rely on naive fusion methods (e.g., concatenation) or
are limited to static scenes (e.g., temporal stereo), neglecting the importance
of the motion cue of objects. These approaches do not fully exploit the
potential of sequential images and show limited performance improvements. To
address this limitation, we propose a novel 3D object detection model, P2D
(Predict to Detect), that integrates a prediction scheme into a detection
framework to explicitly extract and leverage motion features. P2D predicts
object information in the current frame using solely past frames to learn
temporal motion features. We then introduce a novel temporal feature
aggregation method that attentively exploits Bird's-Eye-View (BEV) features
based on predicted object information, resulting in accurate 3D object
detection. Experimental results demonstrate that P2D improves mAP and NDS by
3.0% and 3.7% compared to the sequential image-based baseline, illustrating
that incorporating a prediction scheme can significantly improve detection
accuracy.
Related papers
- Uncertainty Estimation for 3D Object Detection via Evidential Learning [63.61283174146648]
We introduce a framework for quantifying uncertainty in 3D object detection by leveraging an evidential learning loss on Bird's Eye View representations in the 3D detector.
We demonstrate both the efficacy and importance of these uncertainty estimates on identifying out-of-distribution scenes, poorly localized objects, and missing (false negative) detections.
arXiv Detail & Related papers (2024-10-31T13:13:32Z) - UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues [55.69339788566899]
UPose3D is a novel approach for multi-view 3D human pose estimation.
It improves robustness and flexibility without requiring direct 3D annotations.
arXiv Detail & Related papers (2024-04-23T00:18:00Z) - 3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features [70.50665869806188]
3DiffTection is a state-of-the-art method for 3D object detection from single images.
We fine-tune a diffusion model to perform novel view synthesis conditioned on a single image.
We further train the model on target data with detection supervision.
arXiv Detail & Related papers (2023-11-07T23:46:41Z) - AGO-Net: Association-Guided 3D Point Cloud Object Detection Network [86.10213302724085]
We propose a novel 3D detection framework that associates intact features for objects via domain adaptation.
We achieve new state-of-the-art performance on the KITTI 3D detection benchmark in both accuracy and speed.
arXiv Detail & Related papers (2022-08-24T16:54:38Z) - Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for
Temporal Sentence Grounding [61.57847727651068]
Temporal sentence grounding aims to localize a target segment in an untrimmed video semantically according to a given sentence query.
Most previous works focus on learning frame-level features of each whole frame in the entire video, and directly match them with the textual information.
We propose a novel Motion- and Appearance-guided 3D Semantic Reasoning Network (MA3SRN), which incorporates optical-flow-guided motion-aware, detection-based appearance-aware, and 3D-aware object-level features.
arXiv Detail & Related papers (2022-03-06T13:57:09Z) - MDS-Net: A Multi-scale Depth Stratification Based Monocular 3D Object
Detection Algorithm [4.958840734249869]
This paper proposes a one-stage monocular 3D object detection algorithm based on multi-scale depth stratification.
Experiments on the KITTI benchmark show that the MDS-Net outperforms the existing monocular 3D detection methods in 3D detection and BEV detection tasks.
arXiv Detail & Related papers (2022-01-12T07:11:18Z) - DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries [43.02373021724797]
We introduce a framework for multi-camera 3D object detection.
Our method manipulates predictions directly in 3D space.
We achieve state-of-the-art performance on the nuScenes autonomous driving benchmark.
arXiv Detail & Related papers (2021-10-13T17:59:35Z) - Neighbor-Vote: Improving Monocular 3D Object Detection through Neighbor
Distance Voting [12.611269919468999]
We present a novel neighbor-voting method that incorporates neighbor predictions to ameliorate object detection from severely deformed pseudo-LiDAR point clouds.
Our results on the bird's eye view detection outperform the state-of-the-art performance by a large margin, especially for the hard'' level detection.
arXiv Detail & Related papers (2021-07-06T09:18:33Z) - Reinforced Axial Refinement Network for Monocular 3D Object Detection [160.34246529816085]
Monocular 3D object detection aims to extract the 3D position and properties of objects from a 2D input image.
Conventional approaches sample 3D bounding boxes from the space and infer the relationship between the target object and each of them, however, the probability of effective samples is relatively small in the 3D space.
We propose to start with an initial prediction and refine it gradually towards the ground truth, with only one 3d parameter changed in each step.
This requires designing a policy which gets a reward after several steps, and thus we adopt reinforcement learning to optimize it.
arXiv Detail & Related papers (2020-08-31T17:10:48Z) - SMOKE: Single-Stage Monocular 3D Object Detection via Keypoint
Estimation [3.1542695050861544]
Estimating 3D orientation and translation of objects is essential for infrastructure-less autonomous navigation and driving.
We propose a novel 3D object detection method, named SMOKE, that combines a single keypoint estimate with regressed 3D variables.
Despite of its structural simplicity, our proposed SMOKE network outperforms all existing monocular 3D detection methods on the KITTI dataset.
arXiv Detail & Related papers (2020-02-24T08:15:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.