SM4Depth: Seamless Monocular Metric Depth Estimation across Multiple
Cameras and Scenes by One Model
- URL: http://arxiv.org/abs/2403.08556v1
- Date: Wed, 13 Mar 2024 14:08:25 GMT
- Title: SM4Depth: Seamless Monocular Metric Depth Estimation across Multiple
Cameras and Scenes by One Model
- Authors: Yihao Liu and Feng Xue and Anlong Ming
- Abstract summary: This paper proposes SM4Depth, a seamless MMDE method to address all the issues above within a single network.
First, we reveal that a consistent field of view (FOV) is the key to resolve metric ambiguity'' across cameras.
Second, to achieve consistently high accuracy across scenes, we explicitly model the metric scale determination as discretizing the depth interval into bins.
Third, to reduce the reliance on massive training data, we propose a divide and conquer" solution.
- Score: 23.95095404136943
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The generalization of monocular metric depth estimation (MMDE) has been a
longstanding challenge. Recent methods made progress by combining relative and
metric depth or aligning input image focal length. However, they are still
beset by challenges in camera, scene, and data levels: (1) Sensitivity to
different cameras; (2) Inconsistent accuracy across scenes; (3) Reliance on
massive training data. This paper proposes SM4Depth, a seamless MMDE method, to
address all the issues above within a single network. First, we reveal that a
consistent field of view (FOV) is the key to resolve ``metric ambiguity''
across cameras, which guides us to propose a more straightforward preprocessing
unit. Second, to achieve consistently high accuracy across scenes, we
explicitly model the metric scale determination as discretizing the depth
interval into bins and propose variation-based unnormalized depth bins. This
method bridges the depth gap of diverse scenes by reducing the ambiguity of the
conventional metric bin. Third, to reduce the reliance on massive training
data, we propose a ``divide and conquer" solution. Instead of estimating
directly from the vast solution space, the correct metric bins are estimated
from multiple solution sub-spaces for complexity reduction. Finally, with just
150K RGB-D pairs and a consumer-grade GPU for training, SM4Depth achieves
state-of-the-art performance on most previously unseen datasets, especially
surpassing ZoeDepth and Metric3D on mRI$_\theta$. The code can be found at
https://github.com/1hao-Liu/SM4Depth.
Related papers
- ScaleDepth: Decomposing Metric Depth Estimation into Scale Prediction and Relative Depth Estimation [62.600382533322325]
We propose a novel monocular depth estimation method called ScaleDepth.
Our method decomposes metric depth into scene scale and relative depth, and predicts them through a semantic-aware scale prediction module.
Our method achieves metric depth estimation for both indoor and outdoor scenes in a unified framework.
arXiv Detail & Related papers (2024-07-11T05:11:56Z) - UniDepth: Universal Monocular Metric Depth Estimation [81.80512457953903]
We propose a new model, UniDepth, capable of reconstructing metric 3D scenes from solely single images across domains.
Our model exploits a pseudo-spherical output representation, which disentangles camera and depth representations.
Thorough evaluations on ten datasets in a zero-shot regime consistently demonstrate the superior performance of UniDepth.
arXiv Detail & Related papers (2024-03-27T18:06:31Z) - Zero-Shot Metric Depth with a Field-of-View Conditioned Diffusion Model [34.85279074665031]
Methods for monocular depth estimation have made significant strides on standard benchmarks, but zero-shot metric depth estimation remains unsolved.
Recent work has proposed specialized multi-head architectures for jointly modeling indoor and outdoor scenes.
We advocate a generic, task-agnostic diffusion model, with several advancements such as log-scale depth parameterization.
arXiv Detail & Related papers (2023-12-20T18:27:47Z) - NVDS+: Towards Efficient and Versatile Neural Stabilizer for Video Depth Estimation [58.21817572577012]
Video depth estimation aims to infer temporally consistent depth.
We introduce NVDS+ that stabilizes inconsistent depth estimated by various single-image models in a plug-and-play manner.
We also elaborate a large-scale Video Depth in the Wild dataset, which contains 14,203 videos with over two million frames.
arXiv Detail & Related papers (2023-07-17T17:57:01Z) - RayMVSNet++: Learning Ray-based 1D Implicit Fields for Accurate
Multi-View Stereo [21.209964556493368]
RayMVSNet learns sequential prediction of a 1D implicit field along each camera ray with the zero-crossing point indicating scene depth.
RayMVSNet++ achieves state-of-the-art performance on the ScanNet dataset.
arXiv Detail & Related papers (2023-07-16T02:10:47Z) - Towards Accurate Reconstruction of 3D Scene Shape from A Single
Monocular Image [91.71077190961688]
We propose a two-stage framework that first predicts depth up to an unknown scale and shift from a single monocular image.
We then exploits 3D point cloud data to predict the depth shift and the camera's focal length that allow us to recover 3D scene shapes.
We test our depth model on nine unseen datasets and achieve state-of-the-art performance on zero-shot evaluation.
arXiv Detail & Related papers (2022-08-28T16:20:14Z) - DiverseDepth: Affine-invariant Depth Prediction Using Diverse Data [110.29043712400912]
We present a method for depth estimation with monocular images, which can predict high-quality depth on diverse scenes up to an affine transformation.
Experiments show that our method outperforms previous methods on 8 datasets by a large margin with the zero-shot test setting.
arXiv Detail & Related papers (2020-02-03T05:38:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.