Related papers: Boosting Monocular 3D Object Detection with Object-Centric Auxiliary Depth Supervision

Boosting Monocular 3D Object Detection with Object-Centric Auxiliary Depth Supervision

URL: http://arxiv.org/abs/2210.16574v1
Date: Sat, 29 Oct 2022 11:32:28 GMT
Title: Boosting Monocular 3D Object Detection with Object-Centric Auxiliary Depth Supervision
Authors: Youngseok Kim, Sanmin Kim, Sangmin Sim, Jun Won Choi, Dongsuk Kum
Abstract summary: We propose a method to boost the RGB image-based 3D detector by jointly training the detection network with a depth prediction loss analogous to the depth estimation task. Our novel object-centric depth prediction loss focuses on depth around foreground objects, which is important for 3D object detection. Our depth regression model is further trained to predict the uncertainty of depth to represent the 3D confidence of objects.
Score: 13.593246617391266
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in monocular 3D detection leverage a depth estimation network explicitly as an intermediate stage of the 3D detection network. Depth map approaches yield more accurate depth to objects than other methods thanks to the depth estimation network trained on a large-scale dataset. However, depth map approaches can be limited by the accuracy of the depth map, and sequentially using two separated networks for depth estimation and 3D detection significantly increases computation cost and inference time. In this work, we propose a method to boost the RGB image-based 3D detector by jointly training the detection network with a depth prediction loss analogous to the depth estimation task. In this way, our 3D detection network can be supervised by more depth supervision from raw LiDAR points, which does not require any human annotation cost, to estimate accurate depth without explicitly predicting the depth map. Our novel object-centric depth prediction loss focuses on depth around foreground objects, which is important for 3D object detection, to leverage pixel-wise depth supervision in an object-centric manner. Our depth regression model is further trained to predict the uncertainty of depth to represent the 3D confidence of objects. To effectively train the 3D detector with raw LiDAR points and to enable end-to-end training, we revisit the regression target of 3D objects and design a network architecture. Extensive experiments on KITTI and nuScenes benchmarks show that our method can significantly boost the monocular image-based 3D detector to outperform depth map approaches while maintaining the real-time inference speed.

Related papers

Revisiting Monocular 3D Object Detection with Depth Thickness Field [44.4805861813093]
We present MonoDTF, a scene-to-instance depth-adapted network for monocular 3D object detection. The framework mainly comprises a Scene-Level Depth Retargeting (SDR) module and an Instance-Level Spatial Refinement (ISR) module. The latter refines the voxel space with the guidance of instances, enhancing the 3D instance-aware capability of the depth thickness field.
arXiv Detail & Related papers (2024-12-26T10:51:50Z)
OPEN: Object-wise Position Embedding for Multi-view 3D Object Detection [102.0744303467713]
We propose a new multi-view 3D object detector named OPEN. Our main idea is to effectively inject object-wise depth information into the network through our proposed object-wise position embedding. OPEN achieves a new state-of-the-art performance with 64.4% NDS and 56.7% mAP on the nuScenes test benchmark.
arXiv Detail & Related papers (2024-07-15T14:29:15Z)
Toward Accurate Camera-based 3D Object Detection via Cascade Depth Estimation and Calibration [20.82054596017465]
Recent camera-based 3D object detection is limited by the precision of transforming from image to 3D feature spaces. This paper aims to address such a fundamental problem of camera-based 3D object detection: How to effectively learn depth information for accurate feature lifting and object localization.
arXiv Detail & Related papers (2024-02-07T14:21:26Z)
Depth Estimation Matters Most: Improving Per-Object Depth Estimation for Monocular 3D Detection and Tracking [47.59619420444781]
Approaches to monocular 3D perception including detection and tracking often yield inferior performance when compared to LiDAR-based techniques. We propose a multi-level fusion method that combines different representations (RGB and pseudo-LiDAR) and temporal information across multiple frames for objects (tracklets) to enhance per-object depth estimation.
arXiv Detail & Related papers (2022-06-08T03:37:59Z)
Joint Learning of Salient Object Detection, Depth Estimation and Contour Extraction [91.43066633305662]
We propose a novel multi-task and multi-modal filtered transformer (MMFT) network for RGB-D salient object detection (SOD) Specifically, we unify three complementary tasks: depth estimation, salient object detection and contour estimation. The multi-task mechanism promotes the model to learn the task-aware features from the auxiliary tasks. Experiments show that it not only significantly surpasses the depth-based RGB-D SOD methods on multiple datasets, but also precisely predicts a high-quality depth map and salient contour at the same time.
arXiv Detail & Related papers (2022-03-09T17:20:18Z)
MDS-Net: A Multi-scale Depth Stratification Based Monocular 3D Object Detection Algorithm [4.958840734249869]
This paper proposes a one-stage monocular 3D object detection algorithm based on multi-scale depth stratification. Experiments on the KITTI benchmark show that the MDS-Net outperforms the existing monocular 3D detection methods in 3D detection and BEV detection tasks.
arXiv Detail & Related papers (2022-01-12T07:11:18Z)
Learning Geometry-Guided Depth via Projective Modeling for Monocular 3D Object Detection [70.71934539556916]
We learn geometry-guided depth estimation with projective modeling to advance monocular 3D object detection. Specifically, a principled geometry formula with projective modeling of 2D and 3D depth predictions in the monocular 3D object detection network is devised. Our method remarkably improves the detection performance of the state-of-the-art monocular-based method without extra data by 2.80% on the moderate test setting.
arXiv Detail & Related papers (2021-07-29T12:30:39Z)
Aug3D-RPN: Improving Monocular 3D Object Detection by Synthetic Images with Virtual Depth [64.29043589521308]
We propose a rendering module to augment the training data by synthesizing images with virtual-depths. The rendering module takes as input the RGB image and its corresponding sparse depth image, outputs a variety of photo-realistic synthetic images. Besides, we introduce an auxiliary module to improve the detection model by jointly optimizing it through a depth estimation task.
arXiv Detail & Related papers (2021-07-28T11:00:47Z)
VR3Dense: Voxel Representation Learning for 3D Object Detection and Monocular Dense Depth Reconstruction [0.951828574518325]
We introduce a method for jointly training 3D object detection and monocular dense depth reconstruction neural networks. It takes as inputs, a LiDAR point-cloud, and a single RGB image during inference and produces object pose predictions as well as a densely reconstructed depth map. While our object detection is trained in a supervised manner, the depth prediction network is trained with both self-supervised and supervised loss functions.
arXiv Detail & Related papers (2021-04-13T04:25:54Z)
Expandable YOLO: 3D Object Detection from RGB-D Images [64.14512458954344]
This paper aims at constructing a light-weight object detector that inputs a depth and a color image from a stereo camera. By extending the network architecture of YOLOv3 to 3D in the middle, it is possible to output in the depth direction. Intersection over Uninon (IoU) in 3D space is introduced to confirm the accuracy of region extraction results.
arXiv Detail & Related papers (2020-06-26T07:32:30Z)
DELTAS: Depth Estimation by Learning Triangulation And densification of Sparse points [14.254472131009653]
Multi-view stereo (MVS) is the golden mean between the accuracy of active depth sensing and the practicality of monocular depth estimation. Cost volume based approaches employing 3D convolutional neural networks (CNNs) have considerably improved the accuracy of MVS systems. We propose an efficient depth estimation approach by first (a) detecting and evaluating descriptors for interest points, then (b) learning to match and triangulate a small set of interest points, and finally (c) densifying this sparse set of 3D points using CNNs.
arXiv Detail & Related papers (2020-03-19T17:56:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.