DO3D: Self-supervised Learning of Decomposed Object-aware 3D Motion and
Depth from Monocular Videos
- URL: http://arxiv.org/abs/2403.05895v1
- Date: Sat, 9 Mar 2024 12:22:46 GMT
- Title: DO3D: Self-supervised Learning of Decomposed Object-aware 3D Motion and
Depth from Monocular Videos
- Authors: Xiuzhe Wu, Xiaoyang Lyu, Qihao Huang, Yong Liu, Yang Wu, Ying Shan,
Xiaojuan Qi
- Abstract summary: We propose a self-supervised method to jointly learn 3D motion and depth from monocular videos.
Our system contains a depth estimation module to predict depth, and a new decomposed object-wise 3D motion (DO3D) estimation module to predict ego-motion and 3D object motion.
Our model delivers superior performance in all evaluated settings.
- Score: 76.01906393673897
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although considerable advancements have been attained in self-supervised
depth estimation from monocular videos, most existing methods often treat all
objects in a video as static entities, which however violates the dynamic
nature of real-world scenes and fails to model the geometry and motion of
moving objects. In this paper, we propose a self-supervised method to jointly
learn 3D motion and depth from monocular videos. Our system contains a depth
estimation module to predict depth, and a new decomposed object-wise 3D motion
(DO3D) estimation module to predict ego-motion and 3D object motion. Depth and
motion networks work collaboratively to faithfully model the geometry and
dynamics of real-world scenes, which, in turn, benefits both depth and 3D
motion estimation. Their predictions are further combined to synthesize a novel
video frame for self-supervised training. As a core component of our framework,
DO3D is a new motion disentanglement module that learns to predict camera
ego-motion and instance-aware 3D object motion separately. To alleviate the
difficulties in estimating non-rigid 3D object motions, they are decomposed to
object-wise 6-DoF global transformations and a pixel-wise local 3D motion
deformation field. Qualitative and quantitative experiments are conducted on
three benchmark datasets, including KITTI, Cityscapes, and VKITTI2, where our
model delivers superior performance in all evaluated settings. For the depth
estimation task, our model outperforms all compared research works in the
high-resolution setting, attaining an absolute relative depth error (abs rel)
of 0.099 on the KITTI benchmark. Besides, our optical flow estimation results
(an overall EPE of 7.09 on KITTI) also surpass state-of-the-art methods and
largely improve the estimation of dynamic regions, demonstrating the
effectiveness of our motion model. Our code will be available.
Related papers
- AutoDecoding Latent 3D Diffusion Models [95.7279510847827]
We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core.
The 3D autodecoder framework embeds properties learned from the target dataset in the latent space.
We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations.
arXiv Detail & Related papers (2023-07-07T17:59:14Z) - 3D Object Aided Self-Supervised Monocular Depth Estimation [5.579605877061333]
We propose a new method to address dynamic object movements through monocular 3D object detection.
Specifically, we first detect 3D objects in the images and build the per-pixel correspondence of the dynamic pixels with the detected object pose.
In this way, the depth of every pixel can be learned via a meaningful geometry model.
arXiv Detail & Related papers (2022-12-04T08:52:33Z) - Attentive and Contrastive Learning for Joint Depth and Motion Field
Estimation [76.58256020932312]
Estimating the motion of the camera together with the 3D structure of the scene from a monocular vision system is a complex task.
We present a self-supervised learning framework for 3D object motion field estimation from monocular videos.
arXiv Detail & Related papers (2021-10-13T16:45:01Z) - Learning Geometry-Guided Depth via Projective Modeling for Monocular 3D Object Detection [70.71934539556916]
We learn geometry-guided depth estimation with projective modeling to advance monocular 3D object detection.
Specifically, a principled geometry formula with projective modeling of 2D and 3D depth predictions in the monocular 3D object detection network is devised.
Our method remarkably improves the detection performance of the state-of-the-art monocular-based method without extra data by 2.80% on the moderate test setting.
arXiv Detail & Related papers (2021-07-29T12:30:39Z) - Learning Monocular Depth in Dynamic Scenes via Instance-Aware Projection
Consistency [114.02182755620784]
We present an end-to-end joint training framework that explicitly models 6-DoF motion of multiple dynamic objects, ego-motion and depth in a monocular camera setup without supervision.
Our framework is shown to outperform the state-of-the-art depth and motion estimation methods.
arXiv Detail & Related papers (2021-02-04T14:26:42Z) - Kinematics-Guided Reinforcement Learning for Object-Aware 3D Ego-Pose
Estimation [25.03715978502528]
We propose a method for incorporating object interaction and human body dynamics into the task of 3D ego-pose estimation.
We use a kinematics model of the human body to represent the entire range of human motion, and a dynamics model of the body to interact with objects inside a physics simulator.
This is the first work to estimate a physically valid 3D full-body interaction sequence with objects from egocentric videos.
arXiv Detail & Related papers (2020-11-10T00:06:43Z) - Kinematic 3D Object Detection in Monocular Video [123.7119180923524]
We propose a novel method for monocular video-based 3D object detection which carefully leverages kinematic motion to improve precision of 3D localization.
We achieve state-of-the-art performance on monocular 3D object detection and the Bird's Eye View tasks within the KITTI self-driving dataset.
arXiv Detail & Related papers (2020-07-19T01:15:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.