Lightweight and Accurate Multi-View Stereo with Confidence-Aware Diffusion Model
- URL: http://arxiv.org/abs/2509.15220v1
- Date: Thu, 18 Sep 2025 17:59:19 GMT
- Title: Lightweight and Accurate Multi-View Stereo with Confidence-Aware Diffusion Model
- Authors: Fangjinhua Wang, Qingshan Xu, Yew-Soon Ong, Marc Pollefeys,
- Abstract summary: We propose a novel MVS framework, which introduces diffusion models in MVS.<n>Considering the discriminative characteristic of depth estimation, we design a condition encoder to guide the diffusion process.<n>Based on our novel MVS framework, we propose two novel MVS methods, DiffMVS and CasMVS.
- Score: 81.01939699480094
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To reconstruct the 3D geometry from calibrated images, learning-based multi-view stereo (MVS) methods typically perform multi-view depth estimation and then fuse depth maps into a mesh or point cloud. To improve the computational efficiency, many methods initialize a coarse depth map and then gradually refine it in higher resolutions. Recently, diffusion models achieve great success in generation tasks. Starting from a random noise, diffusion models gradually recover the sample with an iterative denoising process. In this paper, we propose a novel MVS framework, which introduces diffusion models in MVS. Specifically, we formulate depth refinement as a conditional diffusion process. Considering the discriminative characteristic of depth estimation, we design a condition encoder to guide the diffusion process. To improve efficiency, we propose a novel diffusion network combining lightweight 2D U-Net and convolutional GRU. Moreover, we propose a novel confidence-based sampling strategy to adaptively sample depth hypotheses based on the confidence estimated by diffusion model. Based on our novel MVS framework, we propose two novel MVS methods, DiffMVS and CasDiffMVS. DiffMVS achieves competitive performance with state-of-the-art efficiency in run-time and GPU memory. CasDiffMVS achieves state-of-the-art performance on DTU, Tanks & Temples and ETH3D. Code is available at: https://github.com/cvg/diffmvs.
Related papers
- DeepInv: A Novel Self-supervised Learning Approach for Fast and Accurate Diffusion Inversion [65.5172878666262]
Diffusion inversion is a challenging task due to the lack of viable supervision signals.<n>We propose a novel self-supervised diffusion inversion approach, termed Deep Inversion (DeepInv)<n>DeepInv is also equipped with an iterative and multi-scale training regime to train a parameterized inversion solver.
arXiv Detail & Related papers (2026-01-04T11:27:26Z) - PSI3D: Plug-and-Play 3D Stochastic Inference with Slice-wise Latent Diffusion Prior [5.104613802755622]
We introduce a Plugand-play algorithm for 3D inference with latent diffusion prior (PSI3D)<n>Specifically, we formulate a Markov chain Monte Carlo approach to reconstruct each two-dimensional (2D) slice by sampling from a 2D latent diffusion model.
arXiv Detail & Related papers (2025-12-20T13:37:22Z) - PointDico: Contrastive 3D Representation Learning Guided by Diffusion Models [5.077352707415241]
textitPointDico learns from both denoising generative modeling and cross-modal contrastive learning through knowledge distillation.<n>textitPointDico achieves a new state-of-the-art in 3D representation learning, textite.g., textbf94.32% accuracy on ScanObjectNN, textbf86.5% Inst. mIoU on ShapeNetPart.
arXiv Detail & Related papers (2025-12-09T07:57:56Z) - FVGen: Accelerating Novel-View Synthesis with Adversarial Video Diffusion Distillation [7.731788894265875]
We present FVGen, a framework that enables fast novel view synthesis using Video Diffusion Models (VDMs) in as few as four sampling steps.<n>Our framework generates the same number of novel views with similar (or even better) visual quality while reducing sampling time by more than 90%.
arXiv Detail & Related papers (2025-08-08T15:22:41Z) - TADA: Improved Diffusion Sampling with Training-free Augmented Dynamics [42.99251753481681]
We introduce a new sampling method that is up to $186%$ faster than the current state of the art solver for comparative FID on ImageNet512.<n>The key to our method resides in using higher-dimensional initial noise, allowing to produce more detailed samples.
arXiv Detail & Related papers (2025-06-26T20:30:27Z) - One-Step Diffusion Model for Image Motion-Deblurring [85.76149042561507]
We propose a one-step diffusion model for deblurring (OSDD), a novel framework that reduces the denoising process to a single step.<n>To tackle fidelity loss in diffusion models, we introduce an enhanced variational autoencoder (eVAE), which improves structural restoration.<n>Our method achieves strong performance on both full and no-reference metrics.
arXiv Detail & Related papers (2025-03-09T09:39:57Z) - Pixel-Aligned Multi-View Generation with Depth Guided Decoder [86.1813201212539]
We propose a novel method for pixel-level image-to-multi-view generation.
Unlike prior work, we incorporate attention layers across multi-view images in the VAE decoder of a latent video diffusion model.
Our model enables better pixel alignment across multi-view images.
arXiv Detail & Related papers (2024-08-26T04:56:41Z) - Sparse3D: Distilling Multiview-Consistent Diffusion for Object
Reconstruction from Sparse Views [47.215089338101066]
We present Sparse3D, a novel 3D reconstruction method tailored for sparse view inputs.
Our approach distills robust priors from a multiview-consistent diffusion model to refine a neural radiance field.
By tapping into 2D priors from powerful image diffusion models, our integrated model consistently delivers high-quality results.
arXiv Detail & Related papers (2023-08-27T11:52:00Z) - One at a Time: Progressive Multi-step Volumetric Probability Learning
for Reliable 3D Scene Perception [59.37727312705997]
This paper proposes to decompose the complicated 3D volume representation learning into a sequence of generative steps.
Considering the recent advances achieved by strong generative diffusion models, we introduce a multi-step learning framework, dubbed as VPD.
For the SSC task, our work stands out as the first to surpass LiDAR-based methods on the Semantic KITTI dataset.
arXiv Detail & Related papers (2023-06-22T05:55:53Z) - The Surprising Effectiveness of Diffusion Models for Optical Flow and
Monocular Depth Estimation [42.48819460873482]
Denoising diffusion probabilistic models have transformed image generation with their impressive fidelity and diversity.
We show that they also excel in estimating optical flow and monocular depth, surprisingly, without task-specific architectures and loss functions.
arXiv Detail & Related papers (2023-06-02T21:26:20Z) - IterMVS: Iterative Probability Estimation for Efficient Multi-View
Stereo [71.84742490020611]
IterMVS is a new data-driven method for high-resolution multi-view stereo.
We propose a novel GRU-based estimator that encodes pixel-wise probability distributions of depth in its hidden state.
We verify the efficiency and effectiveness of our method on DTU, Tanks&Temples and ETH3D.
arXiv Detail & Related papers (2021-12-09T18:58:02Z) - Multi-view Depth Estimation using Epipolar Spatio-Temporal Networks [87.50632573601283]
We present a novel method for multi-view depth estimation from a single video.
Our method achieves temporally coherent depth estimation results by using a novel Epipolar Spatio-Temporal (EST) transformer.
To reduce the computational cost, inspired by recent Mixture-of-Experts models, we design a compact hybrid network.
arXiv Detail & Related papers (2020-11-26T04:04:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.