PrimeDepth: Efficient Monocular Depth Estimation with a Stable Diffusion Preimage
- URL: http://arxiv.org/abs/2409.09144v1
- Date: Fri, 13 Sep 2024 19:03:48 GMT
- Title: PrimeDepth: Efficient Monocular Depth Estimation with a Stable Diffusion Preimage
- Authors: Denis Zavadski, Damjan Kalšan, Carsten Rother,
- Abstract summary: This work addresses the task of zero-shot monocular depth estimation.
A recent advance in this field has been the idea of utilising Text-to-Image foundation models, such as Stable Diffusion.
We present PrimeDepth, a method that is highly efficient at test time while keeping, or even enhancing, the positive aspects of diffusion-based approaches.
- Score: 19.02295657801464
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This work addresses the task of zero-shot monocular depth estimation. A recent advance in this field has been the idea of utilising Text-to-Image foundation models, such as Stable Diffusion. Foundation models provide a rich and generic image representation, and therefore, little training data is required to reformulate them as a depth estimation model that predicts highly-detailed depth maps and has good generalisation capabilities. However, the realisation of this idea has so far led to approaches which are, unfortunately, highly inefficient at test-time due to the underlying iterative denoising process. In this work, we propose a different realisation of this idea and present PrimeDepth, a method that is highly efficient at test time while keeping, or even enhancing, the positive aspects of diffusion-based approaches. Our key idea is to extract from Stable Diffusion a rich, but frozen, image representation by running a single denoising step. This representation, we term preimage, is then fed into a refiner network with an architectural inductive bias, before entering the downstream task. We validate experimentally that PrimeDepth is two orders of magnitude faster than the leading diffusion-based method, Marigold, while being more robust for challenging scenarios and quantitatively marginally superior. Thereby, we reduce the gap to the currently leading data-driven approach, Depth Anything, which is still quantitatively superior, but predicts less detailed depth maps and requires 20 times more labelled data. Due to the complementary nature of our approach, even a simple averaging between PrimeDepth and Depth Anything predictions can improve upon both methods and sets a new state-of-the-art in zero-shot monocular depth estimation. In future, data-driven approaches may also benefit from integrating our preimage.
Related papers
- bit2bit: 1-bit quanta video reconstruction via self-supervised photon prediction [57.199618102578576]
We propose bit2bit, a new method for reconstructing high-quality image stacks at original resolution from sparse binary quantatemporal image data.
Inspired by recent work on Poisson denoising, we developed an algorithm that creates a dense image sequence from sparse binary photon data.
We present a novel dataset containing a wide range of real SPAD high-speed videos under various challenging imaging conditions.
arXiv Detail & Related papers (2024-10-30T17:30:35Z) - Fast constrained sampling in pre-trained diffusion models [77.21486516041391]
Diffusion models have dominated the field of large, generative image models.
We propose an algorithm for fast-constrained sampling in large pre-trained diffusion models.
arXiv Detail & Related papers (2024-10-24T14:52:38Z) - Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think [53.2706196341054]
We show that the perceived inefficiency was caused by a flaw in the inference pipeline that has so far gone unnoticed.
We perform end-to-end fine-tuning on top of the single-step model with task-specific losses and get a deterministic model that outperforms all other diffusion-based depth and normal estimation models.
arXiv Detail & Related papers (2024-09-17T16:58:52Z) - DepthFM: Fast Monocular Depth Estimation with Flow Matching [22.206355073676082]
Current discriminative approaches to this problem are limited due to blurry artifacts.
State-of-the-art generative methods suffer from slow sampling due to their SDE nature.
We observe that this can be effectively framed using flow matching, since its straight trajectories through solution space offer efficiency and high quality.
arXiv Detail & Related papers (2024-03-20T17:51:53Z) - Exploiting Diffusion Prior for Generalizable Dense Prediction [85.4563592053464]
Recent advanced Text-to-Image (T2I) diffusion models are sometimes too imaginative for existing off-the-shelf dense predictors to estimate.
We introduce DMP, a pipeline utilizing pre-trained T2I models as a prior for dense prediction tasks.
Despite limited-domain training data, the approach yields faithful estimations for arbitrary images, surpassing existing state-of-the-art algorithms.
arXiv Detail & Related papers (2023-11-30T18:59:44Z) - Deep Richardson-Lucy Deconvolution for Low-Light Image Deblurring [48.80983873199214]
We develop a data-driven approach to model the saturated pixels by a learned latent map.
Based on the new model, the non-blind deblurring task can be formulated into a maximum a posterior (MAP) problem.
To estimate high-quality deblurred images without amplified artifacts, we develop a prior estimation network.
arXiv Detail & Related papers (2023-08-10T12:53:30Z) - DiffusionDepth: Diffusion Denoising Approach for Monocular Depth
Estimation [23.22005119986485]
DiffusionDepth is a new approach that reformulates monocular depth estimation as a denoising diffusion process.
It learns an iterative denoising process to denoise' random depth distribution into a depth map with the guidance of monocular visual conditions.
Experimental results on KITTI and NYU-Depth-V2 datasets suggest that a simple yet efficient diffusion approach could reach state-of-the-art performance in both indoor and outdoor scenarios with acceptable inference time.
arXiv Detail & Related papers (2023-03-09T03:48:24Z) - Uncertainty-Aware Unsupervised Image Deblurring with Deep Residual Prior [23.417096880297702]
Non-blind deblurring methods achieve decent performance under the accurate blur kernel assumption.
Hand-crafted prior, incorporating domain knowledge, generally performs well but may lead to poor performance when kernel (or induced) error is complex.
Data-driven prior, which excessively depends on the diversity and abundance of training data, is vulnerable to out-of-distribution blurs and images.
We propose an unsupervised semi-blind deblurring model which recovers the latent image from the blurry image and inaccurate blur kernel.
arXiv Detail & Related papers (2022-10-09T11:10:59Z) - Learning Monocular Dense Depth from Events [53.078665310545745]
Event cameras produce brightness changes in the form of a stream of asynchronous events instead of intensity frames.
Recent learning-based approaches have been applied to event-based data, such as monocular depth prediction.
We propose a recurrent architecture to solve this task and show significant improvement over standard feed-forward methods.
arXiv Detail & Related papers (2020-10-16T12:36:23Z) - AcED: Accurate and Edge-consistent Monocular Depth Estimation [0.0]
Single image depth estimation is a challenging problem.
We formulate a fully differentiable ordinal regression and train the network in end-to-end fashion.
A novel per-pixel confidence map computation for depth refinement is also proposed.
arXiv Detail & Related papers (2020-06-16T15:21:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.