DID-M3D: Decoupling Instance Depth for Monocular 3D Object Detection
- URL: http://arxiv.org/abs/2207.08531v1
- Date: Mon, 18 Jul 2022 11:49:18 GMT
- Title: DID-M3D: Decoupling Instance Depth for Monocular 3D Object Detection
- Authors: Liang Peng, Xiaopei Wu, Zheng Yang, Haifeng Liu, and Deng Cai
- Abstract summary: Monocular 3D detection has drawn much attention from the community due to its low cost and setup simplicity.
The most challenging sub-task lies in the instance depth estimation.
We propose to reformulate the instance depth to the combination of the instance visual surface depth and the instance attribute depth.
- Score: 34.01288862240829
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Monocular 3D detection has drawn much attention from the community due to its
low cost and setup simplicity. It takes an RGB image as input and predicts 3D
boxes in the 3D space. The most challenging sub-task lies in the instance depth
estimation. Previous works usually use a direct estimation method. However, in
this paper we point out that the instance depth on the RGB image is
non-intuitive. It is coupled by visual depth clues and instance attribute
clues, making it hard to be directly learned in the network. Therefore, we
propose to reformulate the instance depth to the combination of the instance
visual surface depth (visual depth) and the instance attribute depth (attribute
depth). The visual depth is related to objects' appearances and positions on
the image. By contrast, the attribute depth relies on objects' inherent
attributes, which are invariant to the object affine transformation on the
image. Correspondingly, we decouple the 3D location uncertainty into visual
depth uncertainty and attribute depth uncertainty. By combining different types
of depths and associated uncertainties, we can obtain the final instance depth.
Furthermore, data augmentation in monocular 3D detection is usually limited due
to the physical nature, hindering the boost of performance. Based on the
proposed instance depth disentanglement strategy, we can alleviate this
problem. Evaluated on KITTI, our method achieves new state-of-the-art results,
and extensive ablation studies validate the effectiveness of each component in
our method. The codes are released at https://github.com/SPengLiang/DID-M3D.
Related papers
- MonoCD: Monocular 3D Object Detection with Complementary Depths [9.186673054867866]
Depth estimation is an essential but challenging subtask of monocular 3D object detection.
We propose to increase the complementarity of depths with two novel designs.
Experiments on the KITTI benchmark demonstrate that our method achieves state-of-the-art performance without introducing extra data.
arXiv Detail & Related papers (2024-04-04T03:30:49Z) - Source-free Depth for Object Pop-out [113.24407776545652]
Modern learning-based methods offer promising depth maps by inference in the wild.
We adapt such depth inference models for object segmentation using the objects' "pop-out" prior in 3D.
Our experiments on eight datasets consistently demonstrate the benefit of our method in terms of both performance and generalizability.
arXiv Detail & Related papers (2022-12-10T21:57:11Z) - Depth Is All You Need for Monocular 3D Detection [29.403235118234747]
We propose to align depth representation with the target domain in unsupervised fashions.
Our methods leverage commonly available LiDAR or RGB videos during training time to fine-tune the depth representation, which leads to improved 3D detectors.
arXiv Detail & Related papers (2022-10-05T18:12:30Z) - Monocular 3D Object Detection with Depth from Motion [74.29588921594853]
We take advantage of camera ego-motion for accurate object depth estimation and detection.
Our framework, named Depth from Motion (DfM), then uses the established geometry to lift 2D image features to the 3D space and detects 3D objects thereon.
Our framework outperforms state-of-the-art methods by a large margin on the KITTI benchmark.
arXiv Detail & Related papers (2022-07-26T15:48:46Z) - MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection [61.89277940084792]
We introduce the first DETR framework for Monocular DEtection with a depth-guided TRansformer, named MonoDETR.
We formulate 3D object candidates as learnable queries and propose a depth-guided decoder to conduct object-scene depth interactions.
On KITTI benchmark with monocular images as input, MonoDETR achieves state-of-the-art performance and requires no extra dense depth annotations.
arXiv Detail & Related papers (2022-03-24T19:28:54Z) - Probabilistic and Geometric Depth: Detecting Objects in Perspective [78.00922683083776]
3D object detection is an important capability needed in various practical applications such as driver assistance systems.
Monocular 3D detection, as an economical solution compared to conventional settings relying on binocular vision or LiDAR, has drawn increasing attention recently but still yields unsatisfactory results.
This paper first presents a systematic study on this problem and observes that the current monocular 3D detection problem can be simplified as an instance depth estimation problem.
arXiv Detail & Related papers (2021-07-29T16:30:33Z) - Learning Geometry-Guided Depth via Projective Modeling for Monocular 3D Object Detection [70.71934539556916]
We learn geometry-guided depth estimation with projective modeling to advance monocular 3D object detection.
Specifically, a principled geometry formula with projective modeling of 2D and 3D depth predictions in the monocular 3D object detection network is devised.
Our method remarkably improves the detection performance of the state-of-the-art monocular-based method without extra data by 2.80% on the moderate test setting.
arXiv Detail & Related papers (2021-07-29T12:30:39Z) - VR3Dense: Voxel Representation Learning for 3D Object Detection and
Monocular Dense Depth Reconstruction [0.951828574518325]
We introduce a method for jointly training 3D object detection and monocular dense depth reconstruction neural networks.
It takes as inputs, a LiDAR point-cloud, and a single RGB image during inference and produces object pose predictions as well as a densely reconstructed depth map.
While our object detection is trained in a supervised manner, the depth prediction network is trained with both self-supervised and supervised loss functions.
arXiv Detail & Related papers (2021-04-13T04:25:54Z) - Virtual Normal: Enforcing Geometric Constraints for Accurate and Robust
Depth Prediction [87.08227378010874]
We show the importance of the high-order 3D geometric constraints for depth prediction.
By designing a loss term that enforces a simple geometric constraint, we significantly improve the accuracy and robustness of monocular depth estimation.
We show state-of-the-art results of learning metric depth on NYU Depth-V2 and KITTI.
arXiv Detail & Related papers (2021-03-07T00:08:21Z) - Predicting Relative Depth between Objects from Semantic Features [2.127049691404299]
The 3D depth of objects depicted in 2D images is one such feature.
The state of the art in this area are complex Neural Network models trained on stereo image data to predict depth per pixel.
An overall increase of 14% in relative depth accuracy over relative depth computed from the monodepth model derived results is achieved.
arXiv Detail & Related papers (2021-01-12T17:28:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.