3D Object Aided Self-Supervised Monocular Depth Estimation
- URL: http://arxiv.org/abs/2212.01768v1
- Date: Sun, 4 Dec 2022 08:52:33 GMT
- Title: 3D Object Aided Self-Supervised Monocular Depth Estimation
- Authors: Songlin Wei, Guodong Chen, Wenzheng Chi, Zhenhua Wang and Lining Sun
- Abstract summary: We propose a new method to address dynamic object movements through monocular 3D object detection.
Specifically, we first detect 3D objects in the images and build the per-pixel correspondence of the dynamic pixels with the detected object pose.
In this way, the depth of every pixel can be learned via a meaningful geometry model.
- Score: 5.579605877061333
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Monocular depth estimation has been actively studied in fields such as robot
vision, autonomous driving, and 3D scene understanding. Given a sequence of
color images, unsupervised learning methods based on the framework of
Structure-From-Motion (SfM) simultaneously predict depth and camera relative
pose. However, dynamically moving objects in the scene violate the static world
assumption, resulting in inaccurate depths of dynamic objects. In this work, we
propose a new method to address such dynamic object movements through monocular
3D object detection. Specifically, we first detect 3D objects in the images and
build the per-pixel correspondence of the dynamic pixels with the detected
object pose while leaving the static pixels corresponding to the rigid
background to be modeled with camera motion. In this way, the depth of every
pixel can be learned via a meaningful geometry model. Besides, objects are
detected as cuboids with absolute scale, which is used to eliminate the scale
ambiguity problem inherent in monocular vision. Experiments on the KITTI depth
dataset show that our method achieves State-of-The-Art performance for depth
estimation. Furthermore, joint training of depth, camera motion and object pose
also improves monocular 3D object detection performance. To the best of our
knowledge, this is the first work that allows a monocular 3D object detection
network to be fine-tuned in a self-supervised manner.
Related papers
- DO3D: Self-supervised Learning of Decomposed Object-aware 3D Motion and
Depth from Monocular Videos [76.01906393673897]
We propose a self-supervised method to jointly learn 3D motion and depth from monocular videos.
Our system contains a depth estimation module to predict depth, and a new decomposed object-wise 3D motion (DO3D) estimation module to predict ego-motion and 3D object motion.
Our model delivers superior performance in all evaluated settings.
arXiv Detail & Related papers (2024-03-09T12:22:46Z) - MoGDE: Boosting Mobile Monocular 3D Object Detection with Ground Depth
Estimation [20.697822444708237]
We propose a novel Mono3D framework, called MoGDE, which constantly estimates the corresponding ground depth of an image.
MoGDE yields the best performance compared with the state-of-the-art methods by a large margin and is ranked number one on the KITTI 3D benchmark.
arXiv Detail & Related papers (2023-03-23T04:06:01Z) - Monocular 3D Object Detection with Depth from Motion [74.29588921594853]
We take advantage of camera ego-motion for accurate object depth estimation and detection.
Our framework, named Depth from Motion (DfM), then uses the established geometry to lift 2D image features to the 3D space and detects 3D objects thereon.
Our framework outperforms state-of-the-art methods by a large margin on the KITTI benchmark.
arXiv Detail & Related papers (2022-07-26T15:48:46Z) - Attentive and Contrastive Learning for Joint Depth and Motion Field
Estimation [76.58256020932312]
Estimating the motion of the camera together with the 3D structure of the scene from a monocular vision system is a complex task.
We present a self-supervised learning framework for 3D object motion field estimation from monocular videos.
arXiv Detail & Related papers (2021-10-13T16:45:01Z) - Learning Geometry-Guided Depth via Projective Modeling for Monocular 3D Object Detection [70.71934539556916]
We learn geometry-guided depth estimation with projective modeling to advance monocular 3D object detection.
Specifically, a principled geometry formula with projective modeling of 2D and 3D depth predictions in the monocular 3D object detection network is devised.
Our method remarkably improves the detection performance of the state-of-the-art monocular-based method without extra data by 2.80% on the moderate test setting.
arXiv Detail & Related papers (2021-07-29T12:30:39Z) - MonoGRNet: A General Framework for Monocular 3D Object Detection [23.59839921644492]
We propose MonoGRNet for the amodal 3D object detection from a monocular image via geometric reasoning.
MonoGRNet decomposes the monocular 3D object detection task into four sub-tasks including 2D object detection, instance-level depth estimation, projected 3D center estimation and local corner regression.
Experiments are conducted on KITTI, Cityscapes and MS COCO datasets.
arXiv Detail & Related papers (2021-04-18T10:07:52Z) - Monocular Differentiable Rendering for Self-Supervised 3D Object
Detection [21.825158925459732]
3D object detection from monocular images is an ill-posed problem due to the projective entanglement of depth and scale.
We present a novel self-supervised method for textured 3D shape reconstruction and pose estimation of rigid objects.
Our method predicts the 3D location and meshes of each object in an image using differentiable rendering and a self-supervised objective.
arXiv Detail & Related papers (2020-09-30T09:21:43Z) - Kinematic 3D Object Detection in Monocular Video [123.7119180923524]
We propose a novel method for monocular video-based 3D object detection which carefully leverages kinematic motion to improve precision of 3D localization.
We achieve state-of-the-art performance on monocular 3D object detection and the Bird's Eye View tasks within the KITTI self-driving dataset.
arXiv Detail & Related papers (2020-07-19T01:15:12Z) - Single View Metrology in the Wild [94.7005246862618]
We present a novel approach to single view metrology that can recover the absolute scale of a scene represented by 3D heights of objects or camera height above the ground.
Our method relies on data-driven priors learned by a deep network specifically designed to imbibe weakly supervised constraints from the interplay of the unknown camera with 3D entities such as object heights.
We demonstrate state-of-the-art qualitative and quantitative results on several datasets as well as applications including virtual object insertion.
arXiv Detail & Related papers (2020-07-18T22:31:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.