CubifAE-3D: Monocular Camera Space Cubification for Auto-Encoder based
3D Object Detection
- URL: http://arxiv.org/abs/2006.04080v2
- Date: Tue, 26 Jan 2021 16:29:45 GMT
- Title: CubifAE-3D: Monocular Camera Space Cubification for Auto-Encoder based
3D Object Detection
- Authors: Shubham Shrivastava and Punarjay Chakravarty
- Abstract summary: We introduce a method for 3D object detection using a single monocular image.
We show that we can pre-train the AE using paired RGB and depth images from simulation data once and subsequently only train the 3DOD network using real data.
Our 3DOD network utilizes a particular cubification' of 3D space around the camera, where each cuboid is tasked with predicting N object poses, along with their class and confidence values.
- Score: 8.134961550216618
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a method for 3D object detection using a single monocular image.
Starting from a synthetic dataset, we pre-train an RGB-to-Depth Auto-Encoder
(AE). The embedding learnt from this AE is then used to train a 3D Object
Detector (3DOD) CNN which is used to regress the parameters of 3D object poses
after the encoder from the AE generates a latent embedding from the RGB image.
We show that we can pre-train the AE using paired RGB and depth images from
simulation data once and subsequently only train the 3DOD network using real
data, comprising of RGB images and 3D object pose labels (without the
requirement of dense depth). Our 3DOD network utilizes a particular
`cubification' of 3D space around the camera, where each cuboid is tasked with
predicting N object poses, along with their class and confidence values. The AE
pre-training and this method of dividing the 3D space around the camera into
cuboids give our method its name - CubifAE-3D. We demonstrate results for
monocular 3D object detection in the Autonomous Vehicle (AV) use-case with the
Virtual KITTI 2 and the KITTI datasets.
Related papers
- Tracking Objects with 3D Representation from Videos [57.641129788552675]
We propose a new 2D Multiple Object Tracking paradigm, called P3DTrack.
With 3D object representation learning from Pseudo 3D object labels in monocular videos, we propose a new 2D MOT paradigm, called P3DTrack.
arXiv Detail & Related papers (2023-06-08T17:58:45Z) - OA-BEV: Bringing Object Awareness to Bird's-Eye-View Representation for
Multi-Camera 3D Object Detection [78.38062015443195]
OA-BEV is a network that can be plugged into the BEV-based 3D object detection framework.
Our method achieves consistent improvements over the BEV-based baselines in terms of both average precision and nuScenes detection score.
arXiv Detail & Related papers (2023-01-13T06:02:31Z) - Neural Correspondence Field for Object Pose Estimation [67.96767010122633]
We propose a method for estimating the 6DoF pose of a rigid object with an available 3D model from a single RGB image.
Unlike classical correspondence-based methods which predict 3D object coordinates at pixels of the input image, the proposed method predicts 3D object coordinates at 3D query points sampled in the camera frustum.
arXiv Detail & Related papers (2022-07-30T01:48:23Z) - AutoShape: Real-Time Shape-Aware Monocular 3D Object Detection [15.244852122106634]
We propose an approach for incorporating the shape-aware 2D/3D constraints into the 3D detection framework.
Specifically, we employ the deep neural network to learn distinguished 2D keypoints in the 2D image domain.
For generating the ground truth of 2D/3D keypoints, an automatic model-fitting approach has been proposed.
arXiv Detail & Related papers (2021-08-25T08:50:06Z) - Ground-aware Monocular 3D Object Detection for Autonomous Driving [6.5702792909006735]
Estimating the 3D position and orientation of objects in the environment with a single RGB camera is a challenging task for low-cost urban autonomous driving and mobile robots.
Most of the existing algorithms are based on the geometric constraints in 2D-3D correspondence, which stems from generic 6D object pose estimation.
We introduce a novel neural network module to fully utilize such application-specific priors in the framework of deep learning.
arXiv Detail & Related papers (2021-02-01T08:18:24Z) - E3D: Event-Based 3D Shape Reconstruction [19.823758341937605]
3D shape reconstruction is a primary component of augmented/virtual reality.
Previous solutions based on RGB, RGB-D and Lidar sensors are power and data intensive.
We approach 3D reconstruction with an event camera, a sensor with significantly lower power, latency and data expense.
arXiv Detail & Related papers (2020-12-09T18:23:21Z) - Expandable YOLO: 3D Object Detection from RGB-D Images [64.14512458954344]
This paper aims at constructing a light-weight object detector that inputs a depth and a color image from a stereo camera.
By extending the network architecture of YOLOv3 to 3D in the middle, it is possible to output in the depth direction.
Intersection over Uninon (IoU) in 3D space is introduced to confirm the accuracy of region extraction results.
arXiv Detail & Related papers (2020-06-26T07:32:30Z) - DOPS: Learning to Detect 3D Objects and Predict their 3D Shapes [54.239416488865565]
We propose a fast single-stage 3D object detection method for LIDAR data.
The core novelty of our method is a fast, single-pass architecture that both detects objects in 3D and estimates their shapes.
We find that our proposed method achieves state-of-the-art results by 5% on object detection in ScanNet scenes, and it gets top results by 3.4% in the Open dataset.
arXiv Detail & Related papers (2020-04-02T17:48:50Z) - Atlas: End-to-End 3D Scene Reconstruction from Posed Images [13.154808583020229]
We present an end-to-end 3D reconstruction method for a scene by directly regressing a truncated signed distance function (TSDF) from a set of posed RGB images.
A 2D CNN extracts features from each image independently which are then back-projected and accumulated into a voxel volume.
A 3D CNN refines the accumulated features and predicts the TSDF values.
arXiv Detail & Related papers (2020-03-23T17:59:15Z) - DSGN: Deep Stereo Geometry Network for 3D Object Detection [79.16397166985706]
There is a large performance gap between image-based and LiDAR-based 3D object detectors.
Our method, called Deep Stereo Geometry Network (DSGN), significantly reduces this gap.
For the first time, we provide a simple and effective one-stage stereo-based 3D detection pipeline.
arXiv Detail & Related papers (2020-01-10T11:44:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.