VR3Dense: Voxel Representation Learning for 3D Object Detection and
Monocular Dense Depth Reconstruction
- URL: http://arxiv.org/abs/2104.05932v1
- Date: Tue, 13 Apr 2021 04:25:54 GMT
- Title: VR3Dense: Voxel Representation Learning for 3D Object Detection and
Monocular Dense Depth Reconstruction
- Authors: Shubham Shrivastava
- Abstract summary: We introduce a method for jointly training 3D object detection and monocular dense depth reconstruction neural networks.
It takes as inputs, a LiDAR point-cloud, and a single RGB image during inference and produces object pose predictions as well as a densely reconstructed depth map.
While our object detection is trained in a supervised manner, the depth prediction network is trained with both self-supervised and supervised loss functions.
- Score: 0.951828574518325
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: 3D object detection and dense depth estimation are one of the most vital
tasks in autonomous driving. Multiple sensor modalities can jointly attribute
towards better robot perception, and to that end, we introduce a method for
jointly training 3D object detection and monocular dense depth reconstruction
neural networks. It takes as inputs, a LiDAR point-cloud, and a single RGB
image during inference and produces object pose predictions as well as a
densely reconstructed depth map. LiDAR point-cloud is converted into a set of
voxels, and its features are extracted using 3D convolution layers, from which
we regress object pose parameters. Corresponding RGB image features are
extracted using another 2D convolutional neural network. We further use these
combined features to predict a dense depth map. While our object detection is
trained in a supervised manner, the depth prediction network is trained with
both self-supervised and supervised loss functions. We also introduce a loss
function, edge-preserving smooth loss, and show that this results in better
depth estimation compared to the edge-aware smooth loss function, frequently
used in depth prediction works.
Related papers
- NeRF-Det++: Incorporating Semantic Cues and Perspective-aware Depth
Supervision for Indoor Multi-View 3D Detection [72.0098999512727]
NeRF-Det has achieved impressive performance in indoor multi-view 3D detection by utilizing NeRF to enhance representation learning.
We present three corresponding solutions, including semantic enhancement, perspective-aware sampling, and ordinal depth supervision.
The resulting algorithm, NeRF-Det++, has exhibited appealing performance in the ScanNetV2 and AR KITScenes datasets.
arXiv Detail & Related papers (2024-02-22T11:48:06Z) - Perspective-aware Convolution for Monocular 3D Object Detection [2.33877878310217]
We propose a novel perspective-aware convolutional layer that captures long-range dependencies in images.
By enforcing convolutional kernels to extract features along the depth axis of every image pixel, we incorporates perspective information into network architecture.
We demonstrate improved performance on the KITTI3D dataset, achieving a 23.9% average precision in the easy benchmark.
arXiv Detail & Related papers (2023-08-24T17:25:36Z) - Boosting Monocular 3D Object Detection with Object-Centric Auxiliary
Depth Supervision [13.593246617391266]
We propose a method to boost the RGB image-based 3D detector by jointly training the detection network with a depth prediction loss analogous to the depth estimation task.
Our novel object-centric depth prediction loss focuses on depth around foreground objects, which is important for 3D object detection.
Our depth regression model is further trained to predict the uncertainty of depth to represent the 3D confidence of objects.
arXiv Detail & Related papers (2022-10-29T11:32:28Z) - MonoJSG: Joint Semantic and Geometric Cost Volume for Monocular 3D
Object Detection [10.377424252002792]
monocular 3D object detection lacks accurate depth recovery ability.
Deep neural network (DNN) enables monocular depth-sensing from high-level learned features.
We propose a joint semantic and geometric cost volume to model the depth error.
arXiv Detail & Related papers (2022-03-16T11:54:10Z) - Sparse Depth Completion with Semantic Mesh Deformation Optimization [4.03103540543081]
We propose a neural network with post-optimization, which takes an RGB image and sparse depth samples as input and predicts the complete depth map.
Our evaluation results outperform the existing work consistently on both indoor and outdoor datasets.
arXiv Detail & Related papers (2021-12-10T13:01:06Z) - Virtual Normal: Enforcing Geometric Constraints for Accurate and Robust
Depth Prediction [87.08227378010874]
We show the importance of the high-order 3D geometric constraints for depth prediction.
By designing a loss term that enforces a simple geometric constraint, we significantly improve the accuracy and robustness of monocular depth estimation.
We show state-of-the-art results of learning metric depth on NYU Depth-V2 and KITTI.
arXiv Detail & Related papers (2021-03-07T00:08:21Z) - Ground-aware Monocular 3D Object Detection for Autonomous Driving [6.5702792909006735]
Estimating the 3D position and orientation of objects in the environment with a single RGB camera is a challenging task for low-cost urban autonomous driving and mobile robots.
Most of the existing algorithms are based on the geometric constraints in 2D-3D correspondence, which stems from generic 6D object pose estimation.
We introduce a novel neural network module to fully utilize such application-specific priors in the framework of deep learning.
arXiv Detail & Related papers (2021-02-01T08:18:24Z) - Learning to Recover 3D Scene Shape from a Single Image [98.20106822614392]
We propose a two-stage framework that first predicts depth up to an unknown scale and shift from a single monocular image.
We then use 3D point cloud encoders to predict the missing depth shift and focal length that allow us to recover a realistic 3D scene shape.
arXiv Detail & Related papers (2020-12-17T02:35:13Z) - Expandable YOLO: 3D Object Detection from RGB-D Images [64.14512458954344]
This paper aims at constructing a light-weight object detector that inputs a depth and a color image from a stereo camera.
By extending the network architecture of YOLOv3 to 3D in the middle, it is possible to output in the depth direction.
Intersection over Uninon (IoU) in 3D space is introduced to confirm the accuracy of region extraction results.
arXiv Detail & Related papers (2020-06-26T07:32:30Z) - D3Feat: Joint Learning of Dense Detection and Description of 3D Local
Features [51.04841465193678]
We leverage a 3D fully convolutional network for 3D point clouds.
We propose a novel and practical learning mechanism that densely predicts both a detection score and a description feature for each 3D point.
Our method achieves state-of-the-art results in both indoor and outdoor scenarios.
arXiv Detail & Related papers (2020-03-06T12:51:09Z) - Implicit Functions in Feature Space for 3D Shape Reconstruction and
Completion [53.885984328273686]
Implicit Feature Networks (IF-Nets) deliver continuous outputs, can handle multiple topologies, and complete shapes for missing or sparse input data.
IF-Nets clearly outperform prior work in 3D object reconstruction in ShapeNet, and obtain significantly more accurate 3D human reconstructions.
arXiv Detail & Related papers (2020-03-03T11:14:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.