Diving into the Fusion of Monocular Priors for Generalized Stereo Matching
- URL: http://arxiv.org/abs/2505.14414v1
- Date: Tue, 20 May 2025 14:27:45 GMT
- Title: Diving into the Fusion of Monocular Priors for Generalized Stereo Matching
- Authors: Chengtang Yao, Lidong Yu, Zhidan Liu, Jiaxi Zeng, Yuwei Wu, Yunde Jia,
- Abstract summary: Recently, stereo matching has progressed by leveraging the unbiased monocular prior from the vision foundation model (VFM) to improve the generalization in ill-posed regions.<n>We propose a binary local ordering map to guide the fusion, which converts the depth map into a binary relative format.<n>We also formulate the final direct fusion of monocular depth to the disparity as a registration problem, where a pixel-wise linear regression module can globally and adaptively align them.
- Score: 27.15757281613792
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The matching formulation makes it naturally hard for the stereo matching to handle ill-posed regions like occlusions and non-Lambertian surfaces. Fusing monocular priors has been proven helpful for ill-posed matching, but the biased monocular prior learned from small stereo datasets constrains the generalization. Recently, stereo matching has progressed by leveraging the unbiased monocular prior from the vision foundation model (VFM) to improve the generalization in ill-posed regions. We dive into the fusion process and observe three main problems limiting the fusion of the VFM monocular prior. The first problem is the misalignment between affine-invariant relative monocular depth and absolute depth of disparity. Besides, when we use the monocular feature in an iterative update structure, the over-confidence in the disparity update leads to local optima results. A direct fusion of a monocular depth map could alleviate the local optima problem, but noisy disparity results computed at the first several iterations will misguide the fusion. In this paper, we propose a binary local ordering map to guide the fusion, which converts the depth map into a binary relative format, unifying the relative and absolute depth representation. The computed local ordering map is also used to re-weight the initial disparity update, resolving the local optima and noisy problem. In addition, we formulate the final direct fusion of monocular depth to the disparity as a registration problem, where a pixel-wise linear regression module can globally and adaptively align them. Our method fully exploits the monocular prior to support stereo matching results effectively and efficiently. We significantly improve the performance from the experiments when generalizing from SceneFlow to Middlebury and Booster datasets while barely reducing the efficiency.
Related papers
- PFDepth: Heterogeneous Pinhole-Fisheye Joint Depth Estimation via Distortion-aware Gaussian-Splatted Volumetric Fusion [61.6340987158734]
We present the first pinhole-fisheye framework for heterogeneous multi-view depth estimation, PFDepth.<n> PFDepth employs a unified architecture capable of processing arbitrary combinations of pinhole and fisheye cameras with varied intrinsics and extrinsics.<n>We show that PFDepth sets a state-of-the-art performance on KITTI-360 and RealHet datasets over current mainstream depth networks.
arXiv Detail & Related papers (2025-09-30T09:38:59Z) - OmniDepth: Bridging Monocular and Stereo Reasoning with Latent Alignment [31.118114556998048]
We introduce OmniDepth, a unified framework that bridges monocular and stereo approaches to 3D estimation.<n>At its core, a novel cross-attentive alignment mechanism dynamically synchronizes monocular contextual cues with stereo hypothesis representations.<n>This mutual alignment resolves stereo ambiguities (e.g., specular surfaces) by injecting monocular structure priors while refining monocular depth with stereo geometry.
arXiv Detail & Related papers (2025-08-06T16:31:22Z) - Integrating Disparity Confidence Estimation into Relative Depth Prior-Guided Unsupervised Stereo Matching [55.784713740698365]
Unsupervised stereo matching has garnered significant attention for its independence from costly disparity annotations.<n>A feasible solution lies in transferring 3D geometric knowledge from a relative depth map to the stereo matching networks.<n>This work proposes a novel unsupervised learning framework to address these challenges.
arXiv Detail & Related papers (2025-08-02T09:11:05Z) - MonoMVSNet: Monocular Priors Guided Multi-View Stereo Network [15.138039805633353]
We propose MonoMVSNet, a novel monocular feature and depth guided MVS network.<n>MonoMVSNet integrates powerful priors from a monocular foundation model into multi-view geometry.<n>Experiments demonstrate that MonoMVSNet achieves state-of-the-art performance on the DTU and Tanks-and-Temples datasets.
arXiv Detail & Related papers (2025-07-15T14:05:22Z) - Boosting Omnidirectional Stereo Matching with a Pre-trained Depth Foundation Model [62.37493746544967]
Camera-based setups offer a cost-effective option by using stereo depth estimation to generate dense, high-resolution depth maps.<n>Existing omnidirectional stereo matching approaches achieve only limited depth accuracy across diverse environments.<n>We present DFI-OmniStereo, a novel omnidirectional stereo matching method that leverages a large-scale pre-trained foundation model for relative monocular depth estimation.
arXiv Detail & Related papers (2025-03-30T16:24:22Z) - MonSter++: Unified Stereo Matching, Multi-view Stereo, and Real-time Stereo with Monodepth Priors [52.39201779505421]
MonSter++ is a foundation model for multi-view depth estimation.<n>It integrates monocular depth priors into multi-view depth estimation.<n>MonSter++ achieves new state-of-the-art on both stereo matching and multi-view stereo.
arXiv Detail & Related papers (2025-01-15T08:11:24Z) - Relative Pose Estimation through Affine Corrections of Monocular Depth Priors [69.59216331861437]
We develop three solvers for relative pose estimation that explicitly account for independent affine (scale and shift) ambiguities.<n>We propose a hybrid estimation pipeline that combines our proposed solvers with classic point-based solvers and epipolar constraints.
arXiv Detail & Related papers (2025-01-09T18:58:30Z) - V-FUSE: Volumetric Depth Map Fusion with Long-Range Constraints [6.7197802356130465]
We introduce a learning-based depth map fusion framework that accepts a set of depth and confidence maps generated by a Multi-View Stereo (MVS) algorithm as input and improves them.
We also introduce a depth search window estimation sub-network trained jointly with the larger fusion sub-network to reduce the depth hypothesis search space along each ray.
Our method learns to model depth consensus and violations of visibility constraints directly from the data.
arXiv Detail & Related papers (2023-08-17T00:39:56Z) - Multi-resolution Monocular Depth Map Fusion by Self-supervised
Gradient-based Composition [14.246972408737987]
We propose a novel depth map fusion module to combine the advantages of estimations with multi-resolution inputs.
Our lightweight depth fusion is one-shot and runs in real-time, making our method 80X faster than a state-of-the-art depth fusion method.
arXiv Detail & Related papers (2022-12-03T05:13:50Z) - On Robust Cross-View Consistency in Self-Supervised Monocular Depth Estimation [56.97699793236174]
We study two kinds of robust cross-view consistency in this paper.
We exploit the temporal coherence in both depth feature space and 3D voxel space for self-supervised monocular depth estimation.
Experimental results on several outdoor benchmarks show that our method outperforms current state-of-the-art techniques.
arXiv Detail & Related papers (2022-09-19T03:46:13Z) - Orthogonal Matrix Retrieval with Spatial Consensus for 3D Unknown-View
Tomography [58.60249163402822]
Unknown-view tomography (UVT) reconstructs a 3D density map from its 2D projections at unknown, random orientations.
The proposed OMR is more robust and performs significantly better than the previous state-of-the-art OMR approach.
arXiv Detail & Related papers (2022-07-06T21:40:59Z) - DepthFormer: Exploiting Long-Range Correlation and Local Information for
Accurate Monocular Depth Estimation [50.08080424613603]
Long-range correlation is essential for accurate monocular depth estimation.
We propose to leverage the Transformer to model this global context with an effective attention mechanism.
Our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins.
arXiv Detail & Related papers (2022-03-27T05:03:56Z) - Fusion of Range and Stereo Data for High-Resolution Scene-Modeling [20.824550995195057]
This paper addresses the problem of range-stereo fusion, for the construction of high-resolution depth maps.
We combine low-resolution depth data with high-resolution stereo data, in a maximum a posteriori (MAP) formulation.
The accuracy of the method is not compromised, owing to three properties of the data-term in the energy function.
arXiv Detail & Related papers (2020-12-12T09:37:42Z) - Ladybird: Quasi-Monte Carlo Sampling for Deep Implicit Field Based 3D
Reconstruction with Symmetry [12.511526058118143]
We propose a sampling scheme that theoretically encourages generalization and results in fast convergence for SGD-based optimization algorithms.
Based on the reflective symmetry of an object, we propose a feature fusion method that alleviates issues due to self-occlusions.
Our proposed system Ladybird is able to create high quality 3D object reconstructions from a single input image.
arXiv Detail & Related papers (2020-07-27T09:17:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.