RM-Depth: Unsupervised Learning of Recurrent Monocular Depth in Dynamic
Scenes
- URL: http://arxiv.org/abs/2303.04456v1
- Date: Wed, 8 Mar 2023 09:11:50 GMT
- Title: RM-Depth: Unsupervised Learning of Recurrent Monocular Depth in Dynamic
Scenes
- Authors: Tak-Wai Hui
- Abstract summary: Unsupervised learning framework is proposed to jointly predict monocular depth and complete 3D motion.
Recurrent modulation units are used to adaptively and iteratively fuse encoder and decoder features.
A warping-based network is used to estimate a motion field of moving objects without using semantic priors.
- Score: 7.81768535871051
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Unsupervised methods have showed promising results on monocular depth
estimation. However, the training data must be captured in scenes without
moving objects. To push the envelope of accuracy, recent methods tend to
increase their model parameters. In this paper, an unsupervised learning
framework is proposed to jointly predict monocular depth and complete 3D motion
including the motions of moving objects and camera. (1) Recurrent modulation
units are used to adaptively and iteratively fuse encoder and decoder features.
This not only improves the single-image depth inference but also does not
overspend model parameters. (2) Instead of using a single set of filters for
upsampling, multiple sets of filters are devised for the residual upsampling.
This facilitates the learning of edge-preserving filters and leads to the
improved performance. (3) A warping-based network is used to estimate a motion
field of moving objects without using semantic priors. This breaks down the
requirement of scene rigidity and allows to use general videos for the
unsupervised learning. The motion field is further regularized by an
outlier-aware training loss. Despite the depth model just uses a single image
in test time and 2.97M parameters, it achieves state-of-the-art results on the
KITTI and Cityscapes benchmarks.
Related papers
- AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos [52.726585508669686]
We propose AnyCam, a fast transformer model that directly estimates camera poses and intrinsics from a dynamic video sequence.
We test AnyCam on established datasets, where it delivers accurate camera poses and intrinsics both qualitatively and quantitatively.
By combining camera information, uncertainty, and depth, our model can produce high-quality 4D pointclouds.
arXiv Detail & Related papers (2025-03-30T02:22:11Z) - A Simple yet Effective Test-Time Adaptation for Zero-Shot Monocular Metric Depth Estimation [46.037640130193566]
We propose a new method to rescale Depth Anything predictions using 3D points provided by sensors or techniques such as low-resolution LiDAR or structure-from-motion with poses given by an IMU.
Our experiments highlight enhancements relative to zero-shot monocular metric depth estimation methods, competitive results compared to fine-tuned approaches and a better robustness than depth completion approaches.
arXiv Detail & Related papers (2024-12-18T17:50:15Z) - DO3D: Self-supervised Learning of Decomposed Object-aware 3D Motion and
Depth from Monocular Videos [76.01906393673897]
We propose a self-supervised method to jointly learn 3D motion and depth from monocular videos.
Our system contains a depth estimation module to predict depth, and a new decomposed object-wise 3D motion (DO3D) estimation module to predict ego-motion and 3D object motion.
Our model delivers superior performance in all evaluated settings.
arXiv Detail & Related papers (2024-03-09T12:22:46Z) - Metric3D: Towards Zero-shot Metric 3D Prediction from A Single Image [85.91935485902708]
We show that the key to a zero-shot single-view metric depth model lies in the combination of large-scale data training and resolving the metric ambiguity from various camera models.
We propose a canonical camera space transformation module, which explicitly addresses the ambiguity problems and can be effortlessly plugged into existing monocular models.
Our method enables the accurate recovery of metric 3D structures on randomly collected internet images.
arXiv Detail & Related papers (2023-07-20T16:14:23Z) - SC-DepthV3: Robust Self-supervised Monocular Depth Estimation for
Dynamic Scenes [58.89295356901823]
Self-supervised monocular depth estimation has shown impressive results in static scenes.
It relies on the multi-view consistency assumption for training networks, however, that is violated in dynamic object regions.
We introduce an external pretrained monocular depth estimation model for generating single-image depth prior.
Our model can predict sharp and accurate depth maps, even when training from monocular videos of highly-dynamic scenes.
arXiv Detail & Related papers (2022-11-07T16:17:47Z) - Towards Accurate Reconstruction of 3D Scene Shape from A Single
Monocular Image [91.71077190961688]
We propose a two-stage framework that first predicts depth up to an unknown scale and shift from a single monocular image.
We then exploits 3D point cloud data to predict the depth shift and the camera's focal length that allow us to recover 3D scene shapes.
We test our depth model on nine unseen datasets and achieve state-of-the-art performance on zero-shot evaluation.
arXiv Detail & Related papers (2022-08-28T16:20:14Z) - Monocular 3D Object Detection with Depth from Motion [74.29588921594853]
We take advantage of camera ego-motion for accurate object depth estimation and detection.
Our framework, named Depth from Motion (DfM), then uses the established geometry to lift 2D image features to the 3D space and detects 3D objects thereon.
Our framework outperforms state-of-the-art methods by a large margin on the KITTI benchmark.
arXiv Detail & Related papers (2022-07-26T15:48:46Z) - RAUM-VO: Rotational Adjusted Unsupervised Monocular Visual Odometry [0.0]
We present RAUM-VO, an approach based on a model-free epipolar constraint for frame-to-frame motion estimation.
RAUM-VO shows a considerable accuracy improvement compared to other unsupervised pose networks on the KITTI dataset.
arXiv Detail & Related papers (2022-03-14T15:03:24Z) - Learning to Recover 3D Scene Shape from a Single Image [98.20106822614392]
We propose a two-stage framework that first predicts depth up to an unknown scale and shift from a single monocular image.
We then use 3D point cloud encoders to predict the missing depth shift and focal length that allow us to recover a realistic 3D scene shape.
arXiv Detail & Related papers (2020-12-17T02:35:13Z) - SynDistNet: Self-Supervised Monocular Fisheye Camera Distance Estimation
Synergized with Semantic Segmentation for Autonomous Driving [37.50089104051591]
State-of-the-art self-supervised learning approaches for monocular depth estimation usually suffer from scale ambiguity.
This paper introduces a novel multi-task learning strategy to improve self-supervised monocular distance estimation on fisheye and pinhole camera images.
arXiv Detail & Related papers (2020-08-10T10:52:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.