Metric3D: Towards Zero-shot Metric 3D Prediction from A Single Image
- URL: http://arxiv.org/abs/2307.10984v1
- Date: Thu, 20 Jul 2023 16:14:23 GMT
- Title: Metric3D: Towards Zero-shot Metric 3D Prediction from A Single Image
- Authors: Wei Yin, Chi Zhang, Hao Chen, Zhipeng Cai, Gang Yu, Kaixuan Wang,
Xiaozhi Chen, Chunhua Shen
- Abstract summary: We show that the key to a zero-shot single-view metric depth model lies in the combination of large-scale data training and resolving the metric ambiguity from various camera models.
We propose a canonical camera space transformation module, which explicitly addresses the ambiguity problems and can be effortlessly plugged into existing monocular models.
Our method enables the accurate recovery of metric 3D structures on randomly collected internet images.
- Score: 85.91935485902708
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reconstructing accurate 3D scenes from images is a long-standing vision task.
Due to the ill-posedness of the single-image reconstruction problem, most
well-established methods are built upon multi-view geometry. State-of-the-art
(SOTA) monocular metric depth estimation methods can only handle a single
camera model and are unable to perform mixed-data training due to the metric
ambiguity. Meanwhile, SOTA monocular methods trained on large mixed datasets
achieve zero-shot generalization by learning affine-invariant depths, which
cannot recover real-world metrics. In this work, we show that the key to a
zero-shot single-view metric depth model lies in the combination of large-scale
data training and resolving the metric ambiguity from various camera models. We
propose a canonical camera space transformation module, which explicitly
addresses the ambiguity problems and can be effortlessly plugged into existing
monocular models. Equipped with our module, monocular models can be stably
trained with over 8 million images with thousands of camera models, resulting
in zero-shot generalization to in-the-wild images with unseen camera settings.
Experiments demonstrate SOTA performance of our method on 7 zero-shot
benchmarks. Notably, our method won the championship in the 2nd Monocular Depth
Estimation Challenge. Our method enables the accurate recovery of metric 3D
structures on randomly collected internet images, paving the way for plausible
single-image metrology. The potential benefits extend to downstream tasks,
which can be significantly improved by simply plugging in our model. For
example, our model relieves the scale drift issues of monocular-SLAM (Fig. 1),
leading to high-quality metric scale dense mapping. The code is available at
https://github.com/YvanYin/Metric3D.
Related papers
- Scaling Multi-Camera 3D Object Detection through Weak-to-Strong Eliciting [32.66151412557986]
We present a weak-to-strong eliciting framework aimed at enhancing surround refinement while maintaining robust monocular perception.
Our framework employs weakly tuned experts trained on distinct subsets, and each is inherently biased toward specific camera configurations and scenarios.
For MC3D-Det joint training, the elaborate dataset merge strategy is designed to solve the problem of inconsistent camera numbers and camera parameters.
arXiv Detail & Related papers (2024-04-10T03:11:10Z) - UniDepth: Universal Monocular Metric Depth Estimation [81.80512457953903]
We propose a new model, UniDepth, capable of reconstructing metric 3D scenes from solely single images across domains.
Our model exploits a pseudo-spherical output representation, which disentangles camera and depth representations.
Thorough evaluations on ten datasets in a zero-shot regime consistently demonstrate the superior performance of UniDepth.
arXiv Detail & Related papers (2024-03-27T18:06:31Z) - Metric3Dv2: A Versatile Monocular Geometric Foundation Model for Zero-shot Metric Depth and Surface Normal Estimation [74.28509379811084]
Metric3D v2 is a geometric foundation model for zero-shot metric depth and surface normal estimation from a single image.
We propose solutions for both metric depth estimation and surface normal estimation.
Our method enables the accurate recovery of metric 3D structures on randomly collected internet images.
arXiv Detail & Related papers (2024-03-22T02:30:46Z) - DUSt3R: Geometric 3D Vision Made Easy [8.471330244002564]
We introduce DUSt3R, a novel paradigm for Dense and Unconstrained Stereo 3D Reconstruction of arbitrary image collections.
We show that this formulation smoothly unifies the monocular and binocular reconstruction cases.
Our formulation directly provides a 3D model of the scene as well as depth information, but interestingly, we can seamlessly recover from it, pixel matches, relative and absolute camera.
arXiv Detail & Related papers (2023-12-21T18:52:14Z) - Towards Accurate Reconstruction of 3D Scene Shape from A Single
Monocular Image [91.71077190961688]
We propose a two-stage framework that first predicts depth up to an unknown scale and shift from a single monocular image.
We then exploits 3D point cloud data to predict the depth shift and the camera's focal length that allow us to recover 3D scene shapes.
We test our depth model on nine unseen datasets and achieve state-of-the-art performance on zero-shot evaluation.
arXiv Detail & Related papers (2022-08-28T16:20:14Z) - Synthetic Training for Monocular Human Mesh Recovery [100.38109761268639]
This paper aims to estimate 3D mesh of multiple body parts with large-scale differences from a single RGB image.
The main challenge is lacking training data that have complete 3D annotations of all body parts in 2D images.
We propose a depth-to-scale (D2S) projection to incorporate the depth difference into the projection function to derive per-joint scale variants.
arXiv Detail & Related papers (2020-10-27T03:31:35Z) - Lightweight Multi-View 3D Pose Estimation through Camera-Disentangled
Representation [57.11299763566534]
We present a solution to recover 3D pose from multi-view images captured with spatially calibrated cameras.
We exploit 3D geometry to fuse input images into a unified latent representation of pose, which is disentangled from camera view-points.
Our architecture then conditions the learned representation on camera projection operators to produce accurate per-view 2d detections.
arXiv Detail & Related papers (2020-04-05T12:52:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.