Related papers: GRIN: Zero-Shot Metric Depth with Pixel-Level Diffusion

GRIN: Zero-Shot Metric Depth with Pixel-Level Diffusion

URL: http://arxiv.org/abs/2409.09896v1
Date: Sun, 15 Sep 2024 23:32:04 GMT
Title: GRIN: Zero-Shot Metric Depth with Pixel-Level Diffusion
Authors: Vitor Guizilini, Pavel Tokmakov, Achal Dave, Rares Ambrus,
Abstract summary: We present GRIN, an efficient diffusion model designed to ingest sparse unstructured training data. We show that GRIN establishes a new state of the art in zero-shot metric monocular depth estimation even when trained from scratch.
Score: 27.35300492569507
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: 3D reconstruction from a single image is a long-standing problem in computer vision. Learning-based methods address its inherent scale ambiguity by leveraging increasingly large labeled and unlabeled datasets, to produce geometric priors capable of generating accurate predictions across domains. As a result, state of the art approaches show impressive performance in zero-shot relative and metric depth estimation. Recently, diffusion models have exhibited remarkable scalability and generalizable properties in their learned representations. However, because these models repurpose tools originally designed for image generation, they can only operate on dense ground-truth, which is not available for most depth labels, especially in real-world settings. In this paper we present GRIN, an efficient diffusion model designed to ingest sparse unstructured training data. We use image features with 3D geometric positional encodings to condition the diffusion process both globally and locally, generating depth predictions at a pixel-level. With comprehensive experiments across eight indoor and outdoor datasets, we show that GRIN establishes a new state of the art in zero-shot metric monocular depth estimation even when trained from scratch.

Related papers

Towards Scalable Spatial Intelligence via 2D-to-3D Data Lifting [64.64738535860351]
We present a scalable pipeline that converts single-view images into comprehensive, scale- and appearance-realistic 3D representations.<n>Our method bridges the gap between the vast repository of imagery and the increasing demand for spatial scene understanding.<n>By automatically generating authentic, scale-aware 3D data from images, we significantly reduce data collection costs and open new avenues for advancing spatial intelligence.
arXiv Detail & Related papers (2025-07-24T14:53:26Z)
UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler [62.06785782635153]
We propose a new model, UniDepthV2, capable of reconstructing metric 3D scenes from solely single images across domains. UniDepthV2 directly predicts metric 3D points from the input image at inference time without any additional information. Our model exploits a pseudo-spherical output representation, which disentangles the camera and depth representations.
arXiv Detail & Related papers (2025-02-27T14:03:15Z)
Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local Similarities [88.398085358514]
Contrastive Deepfake Embeddings (CoDE) is a novel embedding space specifically designed for deepfake detection. CoDE is trained via contrastive learning by additionally enforcing global-local similarities.
arXiv Detail & Related papers (2024-07-29T18:00:10Z)
GeoGen: Geometry-Aware Generative Modeling via Signed Distance Functions [22.077366472693395]
We introduce a new generative approach for synthesizing 3D geometry and images from single-view collections. By employing volumetric rendering using neural radiance fields, they inherit a key limitation: the generated geometry is noisy and unconstrained. We propose GeoGen, a new SDF-based 3D generative model trained in an end-to-end manner.
arXiv Detail & Related papers (2024-06-06T17:00:10Z)
GeoWizard: Unleashing the Diffusion Priors for 3D Geometry Estimation from a Single Image [94.56927147492738]
We introduce GeoWizard, a new generative foundation model designed for estimating geometric attributes from single images. We show that leveraging diffusion priors can markedly improve generalization, detail preservation, and efficiency in resource usage. We propose a simple yet effective strategy to segregate the complex data distribution of various scenes into distinct sub-distributions.
arXiv Detail & Related papers (2024-03-18T17:50:41Z)
Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation [20.230238670888454]
We introduce Marigold, a method for affine-invariant monocular depth estimation. It can be fine-tuned in a couple of days on a single GPU using only synthetic training data. It delivers state-of-the-art performance across a wide range of datasets, including over 20% performance gains in specific cases.
arXiv Detail & Related papers (2023-12-04T18:59:13Z)
Robust Geometry-Preserving Depth Estimation Using Differentiable Rendering [93.94371335579321]
We propose a learning framework that trains models to predict geometry-preserving depth without requiring extra data or annotations. Comprehensive experiments underscore our framework's superior generalization capabilities. Our innovative loss functions empower the model to autonomously recover domain-specific scale-and-shift coefficients.
arXiv Detail & Related papers (2023-09-18T12:36:39Z)
FrozenRecon: Pose-free 3D Scene Reconstruction with Frozen Depth Models [67.96827539201071]
We propose a novel test-time optimization approach for 3D scene reconstruction. Our method achieves state-of-the-art cross-dataset reconstruction on five zero-shot testing datasets.
arXiv Detail & Related papers (2023-08-10T17:55:02Z)
GraphCSPN: Geometry-Aware Depth Completion via Dynamic GCNs [49.55919802779889]
We propose a Graph Convolution based Spatial Propagation Network (GraphCSPN) as a general approach for depth completion. In this work, we leverage convolution neural networks as well as graph neural networks in a complementary way for geometric representation learning. Our method achieves the state-of-the-art performance, especially when compared in the case of using only a few propagation steps.
arXiv Detail & Related papers (2022-10-19T17:56:03Z)
Learning to Recover 3D Scene Shape from a Single Image [98.20106822614392]
We propose a two-stage framework that first predicts depth up to an unknown scale and shift from a single monocular image. We then use 3D point cloud encoders to predict the missing depth shift and focal length that allow us to recover a realistic 3D scene shape.
arXiv Detail & Related papers (2020-12-17T02:35:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.