Learning to Predict the 3D Layout of a Scene
- URL: http://arxiv.org/abs/2011.09977v1
- Date: Thu, 19 Nov 2020 17:23:30 GMT
- Title: Learning to Predict the 3D Layout of a Scene
- Authors: Jihao Andreas Lin, Jakob Br\"unker, Daniel F\"ahrmann
- Abstract summary: We propose a method that only uses a single RGB image, thus enabling applications in devices or vehicles that do not have LiDAR sensors.
We use the KITTI dataset for training, which consists of street traffic scenes with class labels, 2D bounding boxes and 3D annotations with seven degrees of freedom.
We achieve a mean average precision of 47.3% for moderately difficult data, measured at a 3D intersection over union threshold of 70%, as required by the official KITTI benchmark; outperforming previous state-of-the-art single RGB only methods by a large margin.
- Score: 0.3867363075280544
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While 2D object detection has improved significantly over the past, real
world applications of computer vision often require an understanding of the 3D
layout of a scene. Many recent approaches to 3D detection use LiDAR point
clouds for prediction. We propose a method that only uses a single RGB image,
thus enabling applications in devices or vehicles that do not have LiDAR
sensors. By using an RGB image, we can leverage the maturity and success of
recent 2D object detectors, by extending a 2D detector with a 3D detection
head. In this paper we discuss different approaches and experiments, including
both regression and classification methods, for designing this 3D detection
head. Furthermore, we evaluate how subproblems and implementation details
impact the overall prediction result. We use the KITTI dataset for training,
which consists of street traffic scenes with class labels, 2D bounding boxes
and 3D annotations with seven degrees of freedom. Our final architecture is
based on Faster R-CNN. The outputs of the convolutional backbone are fixed
sized feature maps for every region of interest. Fully connected layers within
the network head then propose an object class and perform 2D bounding box
regression. We extend the network head by a 3D detection head, which predicts
every degree of freedom of a 3D bounding box via classification. We achieve a
mean average precision of 47.3% for moderately difficult data, measured at a 3D
intersection over union threshold of 70%, as required by the official KITTI
benchmark; outperforming previous state-of-the-art single RGB only methods by a
large margin.
Related papers
- Training an Open-Vocabulary Monocular 3D Object Detection Model without 3D Data [57.53523870705433]
We propose a novel open-vocabulary monocular 3D object detection framework, dubbed OVM3D-Det.
OVM3D-Det does not require high-precision LiDAR or 3D sensor data for either input or generating 3D bounding boxes.
It employs open-vocabulary 2D models and pseudo-LiDAR to automatically label 3D objects in RGB images, fostering the learning of open-vocabulary monocular 3D detectors.
arXiv Detail & Related papers (2024-11-23T21:37:21Z) - Recursive Cross-View: Use Only 2D Detectors to Achieve 3D Object
Detection without 3D Annotations [0.5439020425819]
We propose a method that does not demand any 3D annotations, while being able to predict fully oriented 3D bounding boxes.
Our method, called Recursive Cross-View (RCV), utilizes the three-view principle to convert 3D detection into multiple 2D detection tasks.
RCV is the first 3D detection method that yields fully oriented 3D boxes without consuming 3D labels.
arXiv Detail & Related papers (2022-11-14T04:51:05Z) - FGR: Frustum-Aware Geometric Reasoning for Weakly Supervised 3D Vehicle
Detection [81.79171905308827]
We propose frustum-aware geometric reasoning (FGR) to detect vehicles in point clouds without any 3D annotations.
Our method consists of two stages: coarse 3D segmentation and 3D bounding box estimation.
It is able to accurately detect objects in 3D space with only 2D bounding boxes and sparse point clouds.
arXiv Detail & Related papers (2021-05-17T07:29:55Z) - FCOS3D: Fully Convolutional One-Stage Monocular 3D Object Detection [78.00922683083776]
It is non-trivial to make a general adapted 2D detector work in this 3D task.
In this technical report, we study this problem with a practice built on fully convolutional single-stage detector.
Our solution achieves 1st place out of all the vision-only methods in the nuScenes 3D detection challenge of NeurIPS 2020.
arXiv Detail & Related papers (2021-04-22T09:35:35Z) - 3D-to-2D Distillation for Indoor Scene Parsing [78.36781565047656]
We present a new approach that enables us to leverage 3D features extracted from large-scale 3D data repository to enhance 2D features extracted from RGB images.
First, we distill 3D knowledge from a pretrained 3D network to supervise a 2D network to learn simulated 3D features from 2D features during the training.
Second, we design a two-stage dimension normalization scheme to calibrate the 2D and 3D features for better integration.
Third, we design a semantic-aware adversarial training model to extend our framework for training with unpaired 3D data.
arXiv Detail & Related papers (2021-04-06T02:22:24Z) - ZoomNet: Part-Aware Adaptive Zooming Neural Network for 3D Object
Detection [69.68263074432224]
We present a novel framework named ZoomNet for stereo imagery-based 3D detection.
The pipeline of ZoomNet begins with an ordinary 2D object detection model which is used to obtain pairs of left-right bounding boxes.
To further exploit the abundant texture cues in RGB images for more accurate disparity estimation, we introduce a conceptually straight-forward module -- adaptive zooming.
arXiv Detail & Related papers (2020-03-01T17:18:08Z) - SMOKE: Single-Stage Monocular 3D Object Detection via Keypoint
Estimation [3.1542695050861544]
Estimating 3D orientation and translation of objects is essential for infrastructure-less autonomous navigation and driving.
We propose a novel 3D object detection method, named SMOKE, that combines a single keypoint estimate with regressed 3D variables.
Despite of its structural simplicity, our proposed SMOKE network outperforms all existing monocular 3D detection methods on the KITTI dataset.
arXiv Detail & Related papers (2020-02-24T08:15:36Z) - DSGN: Deep Stereo Geometry Network for 3D Object Detection [79.16397166985706]
There is a large performance gap between image-based and LiDAR-based 3D object detectors.
Our method, called Deep Stereo Geometry Network (DSGN), significantly reduces this gap.
For the first time, we provide a simple and effective one-stage stereo-based 3D detection pipeline.
arXiv Detail & Related papers (2020-01-10T11:44:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.