Related papers: MonoPP: Metric-Scaled Self-Supervised Monocular Depth Estimation by Planar-Parallax Geometry in Automotive Applications

MonoPP: Metric-Scaled Self-Supervised Monocular Depth Estimation by Planar-Parallax Geometry in Automotive Applications

URL: http://arxiv.org/abs/2411.19717v1
Date: Fri, 29 Nov 2024 14:06:58 GMT
Title: MonoPP: Metric-Scaled Self-Supervised Monocular Depth Estimation by Planar-Parallax Geometry in Automotive Applications
Authors: Gasser Elazab, Torben Gräber, Michael Unterreiner, Olaf Hellwich,
Abstract summary: We introduce a novel self-supervised metric-scaled MDE model that requires only monocular video data and the camera's mounting position.<n>Our method achieved state-of-the-art results for the driving benchmark KITTI for metric-scaled depth prediction.<n> Notably, it is one of the first methods to produce self-supervised metric-scaled depth prediction for the challenging Cityscapes dataset.
Score: 2.5249064981269287
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Self-supervised monocular depth estimation (MDE) has gained popularity for obtaining depth predictions directly from videos. However, these methods often produce scale invariant results, unless additional training signals are provided. Addressing this challenge, we introduce a novel self-supervised metric-scaled MDE model that requires only monocular video data and the camera's mounting position, both of which are readily available in modern vehicles. Our approach leverages planar-parallax geometry to reconstruct scene structure. The full pipeline consists of three main networks, a multi-frame network, a singleframe network, and a pose network. The multi-frame network processes sequential frames to estimate the structure of the static scene using planar-parallax geometry and the camera mounting position. Based on this reconstruction, it acts as a teacher, distilling knowledge such as scale information, masked drivable area, metric-scale depth for the static scene, and dynamic object mask to the singleframe network. It also aids the pose network in predicting a metric-scaled relative pose between two subsequent images. Our method achieved state-of-the-art results for the driving benchmark KITTI for metric-scaled depth prediction. Notably, it is one of the first methods to produce self-supervised metric-scaled depth prediction for the challenging Cityscapes dataset, demonstrating its effectiveness and versatility.

Related papers

Review of Feed-forward 3D Reconstruction: From DUSt3R to VGGT [10.984522161856955]
3D reconstruction is a cornerstone technology for numerous applications, including augmented/virtual reality, autonomous driving, and robotics.<n>Deep learning has catalyzed a paradigm shift in 3D reconstruction.<n>New models employ a unified deep network to jointly infer camera poses and dense geometry directly from an Unconstrained set of images in a single forward pass.
arXiv Detail & Related papers (2025-07-11T09:41:54Z)
Learning Multi-frame and Monocular Prior for Estimating Geometry in Dynamic Scenes [56.936178608296906]
We present a new model, coined MMP, to estimate the geometry in a feed-forward manner.<n>Based on the recent Siamese architecture, we introduce a new trajectory encoding module.<n>We find MMP can achieve state-of-the-art quality in feed-forward pointmap prediction.
arXiv Detail & Related papers (2025-05-03T08:28:15Z)
MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion [118.74385965694694]
We present Motion DUSt3R (MonST3R), a novel geometry-first approach that directly estimates per-timestep geometry from dynamic scenes. By simply estimating a pointmap for each timestep, we can effectively adapt DUST3R's representation, previously only used for static scenes, to dynamic scenes. We show that by posing the problem as a fine-tuning task, identifying several suitable datasets, and strategically training the model on this limited data, we can surprisingly enable the model to handle dynamics.
arXiv Detail & Related papers (2024-10-04T18:00:07Z)
DoubleTake: Geometry Guided Depth Estimation [17.464549832122714]
Estimating depth from a sequence of posed RGB images is a fundamental computer vision task. We introduce a reconstruction which combines volume features with a hint of the prior geometry, rendered as a depth map from the current camera location. We demonstrate that our method can run at interactive speeds, state-of-the-art estimates of depth and 3D scene in both offline and incremental evaluation scenarios.
arXiv Detail & Related papers (2024-06-26T14:29:05Z)
OccNeRF: Advancing 3D Occupancy Prediction in LiDAR-Free Environments [77.0399450848749]
We propose an OccNeRF method for training occupancy networks without 3D supervision. We parameterize the reconstructed occupancy fields and reorganize the sampling strategy to align with the cameras' infinite perceptive range. For semantic occupancy prediction, we design several strategies to polish the prompts and filter the outputs of a pretrained open-vocabulary 2D segmentation model.
arXiv Detail & Related papers (2023-12-14T18:58:52Z)
FrozenRecon: Pose-free 3D Scene Reconstruction with Frozen Depth Models [67.96827539201071]
We propose a novel test-time optimization approach for 3D scene reconstruction. Our method achieves state-of-the-art cross-dataset reconstruction on five zero-shot testing datasets.
arXiv Detail & Related papers (2023-08-10T17:55:02Z)
FSNet: Redesign Self-Supervised MonoDepth for Full-Scale Depth Prediction for Autonomous Driving [18.02943016671203]
This study proposes a comprehensive self-supervised framework for accurate scale-aware depth prediction on autonomous driving scenes. In particular, we introduce a Full-Scale depth prediction network named FSNet. With FSNet, robots and vehicles with only one well-calibrated camera can collect sequences of training image frames and camera poses, and infer accurate 3D depths of the environment without extra labeling work or 3D data.
arXiv Detail & Related papers (2023-04-21T03:17:04Z)
Instance-aware multi-object self-supervision for monocular depth prediction [0.0]
This paper proposes a self-supervised monocular image-to-depth prediction framework that is trained with an end-to-end photometric loss. Self-supervision is performed by warping the images across a video sequence using depth and scene motion including object instances.
arXiv Detail & Related papers (2022-03-02T00:59:25Z)
Towards 3D Scene Reconstruction from Locally Scale-Aligned Monocular Video Depth [90.33296913575818]
In some video-based scenarios such as video depth estimation and 3D scene reconstruction from a video, the unknown scale and shift residing in per-frame prediction may cause the depth inconsistency. We propose a locally weighted linear regression method to recover the scale and shift with very sparse anchor points. Our method can boost the performance of existing state-of-the-art approaches by 50% at most over several zero-shot benchmarks.
arXiv Detail & Related papers (2022-02-03T08:52:54Z)
TANDEM: Tracking and Dense Mapping in Real-time using Deep Multi-view Stereo [55.30992853477754]
We present TANDEM, a real-time monocular tracking and dense framework. For pose estimation, TANDEM performs photometric bundle adjustment based on a sliding window of alignments. TANDEM shows state-of-the-art real-time 3D reconstruction performance.
arXiv Detail & Related papers (2021-11-14T19:01:02Z)
Learning Monocular Depth in Dynamic Scenes via Instance-Aware Projection Consistency [114.02182755620784]
We present an end-to-end joint training framework that explicitly models 6-DoF motion of multiple dynamic objects, ego-motion and depth in a monocular camera setup without supervision. Our framework is shown to outperform the state-of-the-art depth and motion estimation methods.
arXiv Detail & Related papers (2021-02-04T14:26:42Z)
Monocular 3D Detection with Geometric Constraints Embedding and Semi-supervised Training [3.8073142980733]
We propose a novel framework for monocular 3D objects detection using only RGB images, called KM3D-Net. We design a fully convolutional model to predict object keypoints, dimension, and orientation, and then combine these estimations with perspective geometry constraints to compute position attribute.
arXiv Detail & Related papers (2020-09-02T00:51:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.