Learning Monocular Depth in Dynamic Scenes via Instance-Aware Projection
Consistency
- URL: http://arxiv.org/abs/2102.02629v1
- Date: Thu, 4 Feb 2021 14:26:42 GMT
- Title: Learning Monocular Depth in Dynamic Scenes via Instance-Aware Projection
Consistency
- Authors: Seokju Lee, Sunghoon Im, Stephen Lin, In So Kweon
- Abstract summary: We present an end-to-end joint training framework that explicitly models 6-DoF motion of multiple dynamic objects, ego-motion and depth in a monocular camera setup without supervision.
Our framework is shown to outperform the state-of-the-art depth and motion estimation methods.
- Score: 114.02182755620784
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present an end-to-end joint training framework that explicitly models
6-DoF motion of multiple dynamic objects, ego-motion and depth in a monocular
camera setup without supervision. Our technical contributions are three-fold.
First, we highlight the fundamental difference between inverse and forward
projection while modeling the individual motion of each rigid object, and
propose a geometrically correct projection pipeline using a neural forward
projection module. Second, we design a unified instance-aware photometric and
geometric consistency loss that holistically imposes self-supervisory signals
for every background and object region. Lastly, we introduce a general-purpose
auto-annotation scheme using any off-the-shelf instance segmentation and
optical flow models to produce video instance segmentation maps that will be
utilized as input to our training pipeline. These proposed elements are
validated in a detailed ablation study. Through extensive experiments conducted
on the KITTI and Cityscapes dataset, our framework is shown to outperform the
state-of-the-art depth and motion estimation methods. Our code, dataset, and
models are available at https://github.com/SeokjuLee/Insta-DM .
Related papers
- MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion [118.74385965694694]
We present Motion DUSt3R (MonST3R), a novel geometry-first approach that directly estimates per-timestep geometry from dynamic scenes.
By simply estimating a pointmap for each timestep, we can effectively adapt DUST3R's representation, previously only used for static scenes, to dynamic scenes.
We show that by posing the problem as a fine-tuning task, identifying several suitable datasets, and strategically training the model on this limited data, we can surprisingly enable the model to handle dynamics.
arXiv Detail & Related papers (2024-10-04T18:00:07Z) - Self-supervised Pretraining and Finetuning for Monocular Depth and Visual Odometry [7.067145619709089]
We show that our self-supervised models can reach state-of-the-art performance 'without bells and whistles'
For all datasets, our method outperforms state-of-the-art methods, in particular for depth prediction task.
arXiv Detail & Related papers (2024-06-16T17:24:20Z) - DO3D: Self-supervised Learning of Decomposed Object-aware 3D Motion and
Depth from Monocular Videos [76.01906393673897]
We propose a self-supervised method to jointly learn 3D motion and depth from monocular videos.
Our system contains a depth estimation module to predict depth, and a new decomposed object-wise 3D motion (DO3D) estimation module to predict ego-motion and 3D object motion.
Our model delivers superior performance in all evaluated settings.
arXiv Detail & Related papers (2024-03-09T12:22:46Z) - UniQuadric: A SLAM Backend for Unknown Rigid Object 3D Tracking and
Light-Weight Modeling [7.626461564400769]
We propose a novel SLAM backend that unifies ego-motion tracking, rigid object motion tracking, and modeling.
Our system showcases the potential application of object perception in complex dynamic scenes.
arXiv Detail & Related papers (2023-09-29T07:50:09Z) - Dyna-DepthFormer: Multi-frame Transformer for Self-Supervised Depth
Estimation in Dynamic Scenes [19.810725397641406]
We propose a novel Dyna-Depthformer framework, which predicts scene depth and 3D motion field jointly.
Our contributions are two-fold. First, we leverage multi-view correlation through a series of self- and cross-attention layers in order to obtain enhanced depth feature representation.
Second, we propose a warping-based Motion Network to estimate the motion field of dynamic objects without using semantic prior.
arXiv Detail & Related papers (2023-01-14T09:43:23Z) - Spatio-Temporal Relation Learning for Video Anomaly Detection [35.59510027883497]
Anomaly identification is highly dependent on the relationship between the object and the scene.
In this paper, we propose a Spatial-Temporal Relation Learning framework to tackle the video anomaly detection task.
Experiments are conducted on three public datasets, and the superior performance over the state-of-the-art methods demonstrates the effectiveness of our method.
arXiv Detail & Related papers (2022-09-27T02:19:31Z) - Segmenting Moving Objects via an Object-Centric Layered Representation [100.26138772664811]
We introduce an object-centric segmentation model with a depth-ordered layer representation.
We introduce a scalable pipeline for generating synthetic training data with multiple objects.
We evaluate the model on standard video segmentation benchmarks.
arXiv Detail & Related papers (2022-07-05T17:59:43Z) - Attentive and Contrastive Learning for Joint Depth and Motion Field
Estimation [76.58256020932312]
Estimating the motion of the camera together with the 3D structure of the scene from a monocular vision system is a complex task.
We present a self-supervised learning framework for 3D object motion field estimation from monocular videos.
arXiv Detail & Related papers (2021-10-13T16:45:01Z) - DONet: Learning Category-Level 6D Object Pose and Size Estimation from
Depth Observation [53.55300278592281]
We propose a method of Category-level 6D Object Pose and Size Estimation (COPSE) from a single depth image.
Our framework makes inferences based on the rich geometric information of the object in the depth channel alone.
Our framework competes with state-of-the-art approaches that require labeled real-world images.
arXiv Detail & Related papers (2021-06-27T10:41:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.