MetricAnything: Scaling Metric Depth Pretraining with Noisy Heterogeneous Sources
- URL: http://arxiv.org/abs/2601.22054v1
- Date: Thu, 29 Jan 2026 17:52:41 GMT
- Title: MetricAnything: Scaling Metric Depth Pretraining with Noisy Heterogeneous Sources
- Authors: Baorui Ma, Jiahui Yang, Donglin Di, Xuancheng Zhang, Jianxun Cui, Hao Li, Yan Xie, Wei Chen,
- Abstract summary: Metric Anything is a simple and scalable pretraining framework for metric depth estimation.<n>It learns metric depth from noisy, diverse 3D sources without manually engineered prompts.<n>We show that Metric Anything can benefit from the same scaling laws that drive modern foundation models.
- Score: 25.21242040780486
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scaling has powered recent advances in vision foundation models, yet extending this paradigm to metric depth estimation remains challenging due to heterogeneous sensor noise, camera-dependent biases, and metric ambiguity in noisy cross-source 3D data. We introduce Metric Anything, a simple and scalable pretraining framework that learns metric depth from noisy, diverse 3D sources without manually engineered prompts, camera-specific modeling, or task-specific architectures. Central to our approach is the Sparse Metric Prompt, created by randomly masking depth maps, which serves as a universal interface that decouples spatial reasoning from sensor and camera biases. Using about 20M image-depth pairs spanning reconstructed, captured, and rendered 3D data across 10000 camera models, we demonstrate-for the first time-a clear scaling trend in the metric depth track. The pretrained model excels at prompt-driven tasks such as depth completion, super-resolution and Radar-camera fusion, while its distilled prompt-free student achieves state-of-the-art results on monocular depth estimation, camera intrinsics recovery, single/multi-view metric 3D reconstruction, and VLA planning. We also show that using pretrained ViT of Metric Anything as a visual encoder significantly boosts Multimodal Large Language Model capabilities in spatial intelligence. These results show that metric depth estimation can benefit from the same scaling laws that drive modern foundation models, establishing a new path toward scalable and efficient real-world metric perception. We open-source MetricAnything at http://metric-anything.github.io/metric-anything-io/ to support community research.
Related papers
- WorldMirror: Universal 3D World Reconstruction with Any-Prior Prompting [51.69408870574092]
We present WorldMirror, an all-in-one, feed-forward model for versatile 3D geometric prediction tasks.<n>Our framework flexibly integrates diverse geometric priors, including camera poses, intrinsics, and depth maps.<n>WorldMirror achieves state-of-the-art performance across diverse benchmarks from camera, point map, depth, and surface normal estimation to novel view synthesis.
arXiv Detail & Related papers (2025-10-12T17:59:09Z) - E3D-Bench: A Benchmark for End-to-End 3D Geometric Foundation Models [78.1674905950243]
We present the first comprehensive benchmark for 3D geometric foundation models (GFMs)<n>GFMs directly predict dense 3D representations in a single feed-forward pass, eliminating the need for slow or unavailable precomputed camera parameters.<n>We evaluate 16 state-of-the-art GFMs, revealing their strengths and limitations across tasks and domains.<n>All code, evaluation scripts, and processed data will be publicly released to accelerate research in 3D spatial intelligence.
arXiv Detail & Related papers (2025-06-02T17:53:09Z) - Learning A Zero-shot Occupancy Network from Vision Foundation Models via Self-supervised Adaptation [41.98740330990215]
This work proposes a novel approach that bridges 2D vision foundation models with 3D tasks.<n>We leverage the zero-shot capabilities of vision-language models for image semantics.<n>We project the semantics into 3D space using the reconstructed metric depth, thereby providing 3D supervision.
arXiv Detail & Related papers (2025-03-10T09:54:40Z) - UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler [62.06785782635153]
We propose a new model, UniDepthV2, capable of reconstructing metric 3D scenes from solely single images across domains.<n>UniDepthV2 directly predicts metric 3D points from the input image at inference time without any additional information.<n>Our model exploits a pseudo-spherical output representation, which disentangles the camera and depth representations.
arXiv Detail & Related papers (2025-02-27T14:03:15Z) - Single-Shot Metric Depth from Focused Plenoptic Cameras [18.412662939667676]
Metric depth estimation from visual sensors is crucial for robots to perceive, navigate, and interact with their environment.<n>Light field imaging provides a promising solution for estimating metric depth by using a unique lens configuration through a single device.<n>Our work explores the potential of focused plenoptic cameras for dense metric depth.
arXiv Detail & Related papers (2024-12-03T11:21:17Z) - MetricGold: Leveraging Text-To-Image Latent Diffusion Models for Metric Depth Estimation [9.639797094021988]
MetricGold is a novel approach that harnesses generative diffusion model's rich priors to improve metric depth estimation.<n>Our experiments demonstrate robust generalization across diverse datasets, producing sharper and higher quality metric depth estimates.
arXiv Detail & Related papers (2024-11-16T20:59:01Z) - UniDepth: Universal Monocular Metric Depth Estimation [81.80512457953903]
We propose a new model, UniDepth, capable of reconstructing metric 3D scenes from solely single images across domains.
Our model exploits a pseudo-spherical output representation, which disentangles camera and depth representations.
Thorough evaluations on ten datasets in a zero-shot regime consistently demonstrate the superior performance of UniDepth.
arXiv Detail & Related papers (2024-03-27T18:06:31Z) - Metric3Dv2: A Versatile Monocular Geometric Foundation Model for Zero-shot Metric Depth and Surface Normal Estimation [74.28509379811084]
Metric3D v2 is a geometric foundation model for zero-shot metric depth and surface normal estimation from a single image.<n>We propose solutions for both metric depth estimation and surface normal estimation.<n>Our method enables the accurate recovery of metric 3D structures on randomly collected internet images.
arXiv Detail & Related papers (2024-03-22T02:30:46Z) - Metric3D: Towards Zero-shot Metric 3D Prediction from A Single Image [85.91935485902708]
We show that the key to a zero-shot single-view metric depth model lies in the combination of large-scale data training and resolving the metric ambiguity from various camera models.
We propose a canonical camera space transformation module, which explicitly addresses the ambiguity problems and can be effortlessly plugged into existing monocular models.
Our method enables the accurate recovery of metric 3D structures on randomly collected internet images.
arXiv Detail & Related papers (2023-07-20T16:14:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.