Revisiting Monocular 3D Object Detection from Scene-Level Depth Retargeting to Instance-Level Spatial Refinement
- URL: http://arxiv.org/abs/2412.19165v1
- Date: Thu, 26 Dec 2024 10:51:50 GMT
- Title: Revisiting Monocular 3D Object Detection from Scene-Level Depth Retargeting to Instance-Level Spatial Refinement
- Authors: Qiude Zhang, Chunyu Lin, Zhijie Shen, Nie Lang, Yao Zhao,
- Abstract summary: Monocular 3D object detection is challenging due to the lack of accurate depth.
Existing depth-assisted solutions still exhibit inferior performance.
We propose a novel depth-adapted monocular 3D object detection network.
- Score: 44.4805861813093
- License:
- Abstract: Monocular 3D object detection is challenging due to the lack of accurate depth. However, existing depth-assisted solutions still exhibit inferior performance, whose reason is universally acknowledged as the unsatisfactory accuracy of monocular depth estimation models. In this paper, we revisit monocular 3D object detection from the depth perspective and formulate an additional issue as the limited 3D structure-aware capability of existing depth representations (\textit{e.g.}, depth one-hot encoding or depth distribution). To address this issue, we propose a novel depth-adapted monocular 3D object detection network, termed \textbf{RD3D}, that mainly comprises a Scene-Level Depth Retargeting (SDR) module and an Instance-Level Spatial Refinement (ISR) module. The former incorporates the scene-level perception of 3D structures, retargeting traditional depth representations to a new formulation: \textbf{Depth Thickness Field}. The latter refines the voxel spatial representation with the guidance of instances, eliminating the ambiguity of 3D occupation and thus improving detection accuracy. Extensive experiments on the KITTI and Waymo datasets demonstrate our superiority to existing state-of-the-art (SoTA) methods and the universality when equipped with different depth estimation models. The code will be available.
Related papers
- Deep Neural Networks for Accurate Depth Estimation with Latent Space Features [0.0]
This study introduces a novel depth estimation framework that leverages latent space features within a deep convolutional neural network.
The proposed model features dual encoder-decoder architecture, enabling both color-to-depth and depth-to-depth transformations.
The framework is thoroughly tested using the NYU Depth V2 dataset, where it sets a new benchmark.
arXiv Detail & Related papers (2025-02-17T13:11:35Z) - OPEN: Object-wise Position Embedding for Multi-view 3D Object Detection [102.0744303467713]
We propose a new multi-view 3D object detector named OPEN.
Our main idea is to effectively inject object-wise depth information into the network through our proposed object-wise position embedding.
OPEN achieves a new state-of-the-art performance with 64.4% NDS and 56.7% mAP on the nuScenes test benchmark.
arXiv Detail & Related papers (2024-07-15T14:29:15Z) - GEOcc: Geometrically Enhanced 3D Occupancy Network with Implicit-Explicit Depth Fusion and Contextual Self-Supervision [49.839374549646884]
This paper presents GEOcc, a Geometric-Enhanced Occupancy network tailored for vision-only surround-view perception.
Our approach achieves State-Of-The-Art performance on the Occ3D-nuScenes dataset with the least image resolution needed and the most weightless image backbone.
arXiv Detail & Related papers (2024-05-17T07:31:20Z) - MonoCD: Monocular 3D Object Detection with Complementary Depths [9.186673054867866]
Depth estimation is an essential but challenging subtask of monocular 3D object detection.
We propose to increase the complementarity of depths with two novel designs.
Experiments on the KITTI benchmark demonstrate that our method achieves state-of-the-art performance without introducing extra data.
arXiv Detail & Related papers (2024-04-04T03:30:49Z) - Boosting Monocular 3D Object Detection with Object-Centric Auxiliary
Depth Supervision [13.593246617391266]
We propose a method to boost the RGB image-based 3D detector by jointly training the detection network with a depth prediction loss analogous to the depth estimation task.
Our novel object-centric depth prediction loss focuses on depth around foreground objects, which is important for 3D object detection.
Our depth regression model is further trained to predict the uncertainty of depth to represent the 3D confidence of objects.
arXiv Detail & Related papers (2022-10-29T11:32:28Z) - MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection [57.969536140562674]
We introduce the first DETR framework for Monocular DEtection with a depth-guided TRansformer, named MonoDETR.
We formulate 3D object candidates as learnable queries and propose a depth-guided decoder to conduct object-scene depth interactions.
On KITTI benchmark with monocular images as input, MonoDETR achieves state-of-the-art performance and requires no extra dense depth annotations.
arXiv Detail & Related papers (2022-03-24T19:28:54Z) - MonoDTR: Monocular 3D Object Detection with Depth-Aware Transformer [25.61949580447076]
We propose MonoDTR, a novel end-to-end depth-aware transformer network for monocular 3D object detection.
It mainly consists of two components: (1) the Depth-Aware Feature Enhancement (DFE) module that implicitly learns depth-aware features without requiring extra computation, and (2) the Depth-Aware Transformer (DTR) module that globally integrates context- and depth-aware features.
Our proposed depth-aware modules can be easily plugged into existing image-only monocular 3D object detectors to improve the performance.
arXiv Detail & Related papers (2022-03-21T13:40:10Z) - MonoJSG: Joint Semantic and Geometric Cost Volume for Monocular 3D
Object Detection [10.377424252002792]
monocular 3D object detection lacks accurate depth recovery ability.
Deep neural network (DNN) enables monocular depth-sensing from high-level learned features.
We propose a joint semantic and geometric cost volume to model the depth error.
arXiv Detail & Related papers (2022-03-16T11:54:10Z) - Learning Geometry-Guided Depth via Projective Modeling for Monocular 3D Object Detection [70.71934539556916]
We learn geometry-guided depth estimation with projective modeling to advance monocular 3D object detection.
Specifically, a principled geometry formula with projective modeling of 2D and 3D depth predictions in the monocular 3D object detection network is devised.
Our method remarkably improves the detection performance of the state-of-the-art monocular-based method without extra data by 2.80% on the moderate test setting.
arXiv Detail & Related papers (2021-07-29T12:30:39Z) - Aug3D-RPN: Improving Monocular 3D Object Detection by Synthetic Images
with Virtual Depth [64.29043589521308]
We propose a rendering module to augment the training data by synthesizing images with virtual-depths.
The rendering module takes as input the RGB image and its corresponding sparse depth image, outputs a variety of photo-realistic synthetic images.
Besides, we introduce an auxiliary module to improve the detection model by jointly optimizing it through a depth estimation task.
arXiv Detail & Related papers (2021-07-28T11:00:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.