DepthFM: Fast Monocular Depth Estimation with Flow Matching
- URL: http://arxiv.org/abs/2403.13788v1
- Date: Wed, 20 Mar 2024 17:51:53 GMT
- Title: DepthFM: Fast Monocular Depth Estimation with Flow Matching
- Authors: Ming Gui, Johannes S. Fischer, Ulrich Prestel, Pingchuan Ma, Dmytro Kotovenko, Olga Grebenkova, Stefan Andreas Baumann, Vincent Tao Hu, Björn Ommer,
- Abstract summary: Current discriminative approaches to this problem are limited due to blurry artifacts.
State-of-the-art generative methods suffer from slow sampling due to their SDE nature.
We observe that this can be effectively framed using flow matching, since its straight trajectories through solution space offer efficiency and high quality.
- Score: 22.206355073676082
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Monocular depth estimation is crucial for numerous downstream vision tasks and applications. Current discriminative approaches to this problem are limited due to blurry artifacts, while state-of-the-art generative methods suffer from slow sampling due to their SDE nature. Rather than starting from noise, we seek a direct mapping from input image to depth map. We observe that this can be effectively framed using flow matching, since its straight trajectories through solution space offer efficiency and high quality. Our study demonstrates that a pre-trained image diffusion model can serve as an adequate prior for a flow matching depth model, allowing efficient training on only synthetic data to generalize to real images. We find that an auxiliary surface normals loss further improves the depth estimates. Due to the generative nature of our approach, our model reliably predicts the confidence of its depth estimates. On standard benchmarks of complex natural scenes, our lightweight approach exhibits state-of-the-art performance at favorable low computational cost despite only being trained on little synthetic data.
Related papers
- Fast constrained sampling in pre-trained diffusion models [77.21486516041391]
Diffusion models have dominated the field of large, generative image models.
We propose an algorithm for fast-constrained sampling in large pre-trained diffusion models.
arXiv Detail & Related papers (2024-10-24T14:52:38Z) - Towards Unsupervised Blind Face Restoration using Diffusion Prior [12.69610609088771]
Blind face restoration methods have shown remarkable performance when trained on large-scale synthetic datasets with supervised learning.
These datasets are often generated by simulating low-quality face images with a handcrafted image degradation pipeline.
In this paper, we address this issue by using only a set of input images, with unknown degradations and without ground truth targets, to fine-tune a restoration model.
Our best model also achieves the state-of-the-art results on both synthetic and real-world datasets.
arXiv Detail & Related papers (2024-10-06T20:38:14Z) - Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think [53.2706196341054]
We show that the perceived inefficiency was caused by a flaw in the inference pipeline that has so far gone unnoticed.
We perform end-to-end fine-tuning on top of the single-step model with task-specific losses and get a deterministic model that outperforms all other diffusion-based depth and normal estimation models.
arXiv Detail & Related papers (2024-09-17T16:58:52Z) - PrimeDepth: Efficient Monocular Depth Estimation with a Stable Diffusion Preimage [19.02295657801464]
This work addresses the task of zero-shot monocular depth estimation.
A recent advance in this field has been the idea of utilising Text-to-Image foundation models, such as Stable Diffusion.
We present PrimeDepth, a method that is highly efficient at test time while keeping, or even enhancing, the positive aspects of diffusion-based approaches.
arXiv Detail & Related papers (2024-09-13T19:03:48Z) - Consistency Regularisation for Unsupervised Domain Adaptation in Monocular Depth Estimation [15.285720572043678]
We formulate unsupervised domain adaptation for monocular depth estimation as a consistency-based semi-supervised learning problem.
We introduce a pairwise loss function that regularises predictions on the source domain while enforcing consistency across multiple augmented views.
In our experiments, we rely on the standard depth estimation benchmarks KITTI and NYUv2 to demonstrate state-of-the-art results.
arXiv Detail & Related papers (2024-05-27T23:32:06Z) - FlowDepth: Decoupling Optical Flow for Self-Supervised Monocular Depth Estimation [8.78717459496649]
We propose FlowDepth, where a Dynamic Motion Flow Module (DMFM) decouples the optical flow by a mechanism-based approach and warps the dynamic regions thus solving the mismatch problem.
For the unfairness of photometric errors caused by high-freq and low-texture regions, we use Depth-Cue-Aware Blur (DCABlur) and Cost-Volume sparsity loss respectively at the input and the loss level to solve the problem.
arXiv Detail & Related papers (2024-03-28T10:31:23Z) - The Surprising Effectiveness of Diffusion Models for Optical Flow and
Monocular Depth Estimation [42.48819460873482]
Denoising diffusion probabilistic models have transformed image generation with their impressive fidelity and diversity.
We show that they also excel in estimating optical flow and monocular depth, surprisingly, without task-specific architectures and loss functions.
arXiv Detail & Related papers (2023-06-02T21:26:20Z) - Low-Light Image Enhancement with Wavelet-based Diffusion Models [50.632343822790006]
Diffusion models have achieved promising results in image restoration tasks, yet suffer from time-consuming, excessive computational resource consumption, and unstable restoration.
We propose a robust and efficient Diffusion-based Low-Light image enhancement approach, dubbed DiffLL.
arXiv Detail & Related papers (2023-06-01T03:08:28Z) - SC-DepthV3: Robust Self-supervised Monocular Depth Estimation for
Dynamic Scenes [58.89295356901823]
Self-supervised monocular depth estimation has shown impressive results in static scenes.
It relies on the multi-view consistency assumption for training networks, however, that is violated in dynamic object regions.
We introduce an external pretrained monocular depth estimation model for generating single-image depth prior.
Our model can predict sharp and accurate depth maps, even when training from monocular videos of highly-dynamic scenes.
arXiv Detail & Related papers (2022-11-07T16:17:47Z) - Occlusion-aware Unsupervised Learning of Depth from 4-D Light Fields [50.435129905215284]
We present an unsupervised learning-based depth estimation method for 4-D light field processing and analysis.
Based on the basic knowledge of the unique geometry structure of light field data, we explore the angular coherence among subsets of the light field views to estimate depth maps.
Our method can significantly shrink the performance gap between the previous unsupervised method and supervised ones, and produce depth maps with comparable accuracy to traditional methods with obviously reduced computational cost.
arXiv Detail & Related papers (2021-06-06T06:19:50Z) - A Plug-and-play Scheme to Adapt Image Saliency Deep Model for Video Data [54.198279280967185]
This paper proposes a novel plug-and-play scheme to weakly retrain a pretrained image saliency deep model for video data.
Our method is simple yet effective for adapting any off-the-shelf pre-trained image saliency deep model to obtain high-quality video saliency detection.
arXiv Detail & Related papers (2020-08-02T13:23:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.