Related papers: BetterDepth: Plug-and-Play Diffusion Refiner for Zero-Shot Monocular Depth Estimation

BetterDepth: Plug-and-Play Diffusion Refiner for Zero-Shot Monocular Depth Estimation

URL: http://arxiv.org/abs/2407.17952v1
Date: Thu, 25 Jul 2024 11:16:37 GMT
Title: BetterDepth: Plug-and-Play Diffusion Refiner for Zero-Shot Monocular Depth Estimation
Authors: Xiang Zhang, Bingxin Ke, Hayko Riemenschneider, Nando Metzger, Anton Obukhov, Markus Gross, Konrad Schindler, Christopher Schroers,
Abstract summary: BetterDepth is a conditional diffusion-based refiner that takes the prediction from pre-trained MDE models as depth conditioning. By efficient training on small-scale synthetic datasets, BetterDepth achieves state-of-the-art zero-shot MDE performance. BetterDepth can improve the performance of other MDE models in a plug-and-play manner without additional re-training.
Score: 25.047835960649167
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: By training over large-scale datasets, zero-shot monocular depth estimation (MDE) methods show robust performance in the wild but often suffer from insufficiently precise details. Although recent diffusion-based MDE approaches exhibit appealing detail extraction ability, they still struggle in geometrically challenging scenes due to the difficulty of gaining robust geometric priors from diverse datasets. To leverage the complementary merits of both worlds, we propose BetterDepth to efficiently achieve geometrically correct affine-invariant MDE performance while capturing fine-grained details. Specifically, BetterDepth is a conditional diffusion-based refiner that takes the prediction from pre-trained MDE models as depth conditioning, in which the global depth context is well-captured, and iteratively refines details based on the input image. For the training of such a refiner, we propose global pre-alignment and local patch masking methods to ensure the faithfulness of BetterDepth to depth conditioning while learning to capture fine-grained scene details. By efficient training on small-scale synthetic datasets, BetterDepth achieves state-of-the-art zero-shot MDE performance on diverse public datasets and in-the-wild scenes. Moreover, BetterDepth can improve the performance of other MDE models in a plug-and-play manner without additional re-training.

Related papers

Propagating Sparse Depth via Depth Foundation Model for Out-of-Distribution Depth Completion [33.854696587141355]
We propose a novel depth completion framework that leverages depth foundation models to attain remarkable robustness without large-scale training.<n>Specifically, we leverage a depth foundation model to extract environmental cues, including structural and semantic context, from RGB images to guide the propagation of sparse depth information into missing regions.<n>Our framework performs remarkably well in the OOD scenarios and outperforms existing state-of-the-art depth completion methods.
arXiv Detail & Related papers (2025-08-07T02:38:24Z)
DepR: Depth Guided Single-view Scene Reconstruction with Instance-level Diffusion [59.25479674775212]
DepR is a depth-guided single-view scene reconstruction framework.<n>It generates individual objects and composes them into a coherent 3D layout.<n>It achieves state-of-the-art performance despite being trained on limited synthetic data.
arXiv Detail & Related papers (2025-07-30T16:40:46Z)
Depth Anything with Any Prior [64.39991799606146]
Prior Depth Anything is a framework that combines incomplete but precise metric information in depth measurement with relative but complete geometric structures in depth prediction.<n>We develop a conditioned monocular depth estimation (MDE) model to refine the inherent noise of depth priors.<n>Our model showcases impressive zero-shot generalization across depth completion, super-resolution, and inpainting over 7 real-world datasets.
arXiv Detail & Related papers (2025-05-15T17:59:50Z)
Decompositional Neural Scene Reconstruction with Generative Diffusion Prior [64.71091831762214]
Decompositional reconstruction of 3D scenes, with complete shapes and detailed texture, is intriguing for downstream applications. Recent approaches incorporate semantic or geometric regularization to address this issue, but they suffer significant degradation in underconstrained areas. We propose DP-Recon, which employs diffusion priors in the form of Score Distillation Sampling (SDS) to optimize the neural representation of each individual object under novel views.
arXiv Detail & Related papers (2025-03-19T02:11:31Z)
UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler [62.06785782635153]
We propose a new model, UniDepthV2, capable of reconstructing metric 3D scenes from solely single images across domains. UniDepthV2 directly predicts metric 3D points from the input image at inference time without any additional information. Our model exploits a pseudo-spherical output representation, which disentangles the camera and depth representations.
arXiv Detail & Related papers (2025-02-27T14:03:15Z)
FiffDepth: Feed-forward Transformation of Diffusion-Based Generators for Detailed Depth Estimation [31.06080108012735]
We propose an efficient Monocular Depth Estimation (MDE) approach named FiffDepth. FiffDepth transforms diffusion-based image generators into a feed-forward architecture for detailed depth estimation. We demonstrate that FiffDepth achieves exceptional accuracy, stability, and fine-grained detail, offering significant improvements in MDE performance.
arXiv Detail & Related papers (2024-12-01T04:59:34Z)
Mixed Diffusion for 3D Indoor Scene Synthesis [55.94569112629208]
We present MiDiffusion, a novel mixed discrete-continuous diffusion model architecture. We represent a scene layout by a 2D floor plan and a set of objects, each defined by its category, location, size, and orientation. Our experimental results demonstrate that MiDiffusion substantially outperforms state-of-the-art autoregressive and diffusion models in floor-conditioned 3D scene synthesis.
arXiv Detail & Related papers (2024-05-31T17:54:52Z)
IPoD: Implicit Field Learning with Point Diffusion for Generalizable 3D Object Reconstruction from Single RGB-D Images [50.4538089115248]
Generalizable 3D object reconstruction from single-view RGB-D images remains a challenging task. We propose a novel approach, IPoD, which harmonizes implicit field learning with point diffusion. Experiments conducted on the CO3D-v2 dataset affirm the superiority of IPoD, achieving 7.8% improvement in F-score and 28.6% in Chamfer distance over existing methods.
arXiv Detail & Related papers (2024-03-30T07:17:37Z)
UniDepth: Universal Monocular Metric Depth Estimation [81.80512457953903]
We propose a new model, UniDepth, capable of reconstructing metric 3D scenes from solely single images across domains. Our model exploits a pseudo-spherical output representation, which disentangles camera and depth representations. Thorough evaluations on ten datasets in a zero-shot regime consistently demonstrate the superior performance of UniDepth.
arXiv Detail & Related papers (2024-03-27T18:06:31Z)
SM4Depth: Seamless Monocular Metric Depth Estimation across Multiple Cameras and Scenes by One Model [72.0795843450604]
Current approaches face challenges in maintaining consistent accuracy across diverse scenes. These methods rely on extensive datasets comprising millions, if not tens of millions, of data for training. This paper presents SM$4$Depth, a model that seamlessly works for both indoor and outdoor scenes.
arXiv Detail & Related papers (2024-03-13T14:08:25Z)
Zero-Shot Metric Depth with a Field-of-View Conditioned Diffusion Model [34.85279074665031]
Methods for monocular depth estimation have made significant strides on standard benchmarks, but zero-shot metric depth estimation remains unsolved. Recent work has proposed specialized multi-head architectures for jointly modeling indoor and outdoor scenes. We advocate a generic, task-agnostic diffusion model, with several advancements such as log-scale depth parameterization.
arXiv Detail & Related papers (2023-12-20T18:27:47Z)
FrozenRecon: Pose-free 3D Scene Reconstruction with Frozen Depth Models [67.96827539201071]
We propose a novel test-time optimization approach for 3D scene reconstruction. Our method achieves state-of-the-art cross-dataset reconstruction on five zero-shot testing datasets.
arXiv Detail & Related papers (2023-08-10T17:55:02Z)
Monocular Depth Estimation using Diffusion Models [39.27361388836347]
We introduce innovations to address problems arising due to noisy, incomplete depth maps in training data. To cope with the limited availability of data for supervised training, we leverage pre-training on self-supervised image-to-image translation tasks. Our DepthGen model achieves SOTA performance on the indoor NYU dataset, and near SOTA results on the outdoor KITTI dataset.
arXiv Detail & Related papers (2023-02-28T18:08:21Z)
SC-DepthV3: Robust Self-supervised Monocular Depth Estimation for Dynamic Scenes [58.89295356901823]
Self-supervised monocular depth estimation has shown impressive results in static scenes. It relies on the multi-view consistency assumption for training networks, however, that is violated in dynamic object regions. We introduce an external pretrained monocular depth estimation model for generating single-image depth prior. Our model can predict sharp and accurate depth maps, even when training from monocular videos of highly-dynamic scenes.
arXiv Detail & Related papers (2022-11-07T16:17:47Z)
Dense Depth Distillation with Out-of-Distribution Simulated Images [30.79756881887895]
We study data-free knowledge distillation (KD) for monocular depth estimation (MDE) KD learns a lightweight model for real-world depth perception tasks by compressing it from a trained teacher model while lacking training data in the target domain. We show that our method outperforms the baseline KD by a good margin and even slightly better performance with as few as 1/6 of training images.
arXiv Detail & Related papers (2022-08-26T07:10:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.