MonSter++: Unified Stereo Matching, Multi-view Stereo, and Real-time Stereo with Monodepth Priors
- URL: http://arxiv.org/abs/2501.08643v2
- Date: Thu, 25 Sep 2025 06:51:56 GMT
- Title: MonSter++: Unified Stereo Matching, Multi-view Stereo, and Real-time Stereo with Monodepth Priors
- Authors: Junda Cheng, Wenjing Liao, Zhipeng Cai, Longliang Liu, Gangwei Xu, Xianqi Wang, Yuzhou Wang, Zikang Yuan, Yong Deng, Jinliang Zang, Yangyang Shi, Jinhui Tang, Xin Yang,
- Abstract summary: MonSter++ is a foundation model for multi-view depth estimation.<n>It integrates monocular depth priors into multi-view depth estimation.<n>MonSter++ achieves new state-of-the-art on both stereo matching and multi-view stereo.
- Score: 52.39201779505421
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce MonSter++, a geometric foundation model for multi-view depth estimation, unifying rectified stereo matching and unrectified multi-view stereo. Both tasks fundamentally recover metric depth from correspondence search and consequently face the same dilemma: struggling to handle ill-posed regions with limited matching cues. To address this, we propose MonSter++, a novel method that integrates monocular depth priors into multi-view depth estimation, effectively combining the complementary strengths of single-view and multi-view cues. MonSter++ fuses monocular depth and multi-view depth into a dual-branched architecture. Confidence-based guidance adaptively selects reliable multi-view cues to correct scale ambiguity in monocular depth. The refined monocular predictions, in turn, effectively guide multi-view estimation in ill-posed regions. This iterative mutual enhancement enables MonSter++ to evolve coarse object-level monocular priors into fine-grained, pixel-level geometry, fully unlocking the potential of multi-view depth estimation. MonSter++ achieves new state-of-the-art on both stereo matching and multi-view stereo. By effectively incorporating monocular priors through our cascaded search and multi-scale depth fusion strategy, our real-time variant RT-MonSter++ also outperforms previous real-time methods by a large margin. As shown in Fig.1, MonSter++ achieves significant improvements over previous methods across eight benchmarks from three tasks -- stereo matching, real-time stereo matching, and multi-view stereo, demonstrating the strong generality of our framework. Besides high accuracy, MonSter++ also demonstrates superior zero-shot generalization capability. We will release both the large and the real-time models to facilitate their use by the open-source community.
Related papers
- PromptStereo: Zero-Shot Stereo Matching via Structure and Motion Prompts [25.236900618180652]
We propose Prompt Recurrent Unit (PRU), a novel iterative refinement module based on the decoder of monocular depth foundation models.<n>By integrating monocular structure and stereo motion cues as prompts into the decoder, PRU enriches the latent representations of monocular depth foundation models with absolute stereo-scale information.<n>Our experiments demonstrate that our PromptStereo achieves state-of-the-art zero-shot generalization performance across multiple datasets.
arXiv Detail & Related papers (2026-03-02T09:30:32Z) - OmniDepth: Bridging Monocular and Stereo Reasoning with Latent Alignment [31.118114556998048]
We introduce OmniDepth, a unified framework that bridges monocular and stereo approaches to 3D estimation.<n>At its core, a novel cross-attentive alignment mechanism dynamically synchronizes monocular contextual cues with stereo hypothesis representations.<n>This mutual alignment resolves stereo ambiguities (e.g., specular surfaces) by injecting monocular structure priors while refining monocular depth with stereo geometry.
arXiv Detail & Related papers (2025-08-06T16:31:22Z) - MonoMVSNet: Monocular Priors Guided Multi-View Stereo Network [15.138039805633353]
We propose MonoMVSNet, a novel monocular feature and depth guided MVS network.<n>MonoMVSNet integrates powerful priors from a monocular foundation model into multi-view geometry.<n>Experiments demonstrate that MonoMVSNet achieves state-of-the-art performance on the DTU and Tanks-and-Temples datasets.
arXiv Detail & Related papers (2025-07-15T14:05:22Z) - Diving into the Fusion of Monocular Priors for Generalized Stereo Matching [27.15757281613792]
Recently, stereo matching has progressed by leveraging the unbiased monocular prior from the vision foundation model (VFM) to improve the generalization in ill-posed regions.<n>We propose a binary local ordering map to guide the fusion, which converts the depth map into a binary relative format.<n>We also formulate the final direct fusion of monocular depth to the disparity as a registration problem, where a pixel-wise linear regression module can globally and adaptively align them.
arXiv Detail & Related papers (2025-05-20T14:27:45Z) - Boosting Omnidirectional Stereo Matching with a Pre-trained Depth Foundation Model [62.37493746544967]
Camera-based setups offer a cost-effective option by using stereo depth estimation to generate dense, high-resolution depth maps.
Existing omnidirectional stereo matching approaches achieve only limited depth accuracy across diverse environments.
We present DFI-OmniStereo, a novel omnidirectional stereo matching method that leverages a large-scale pre-trained foundation model for relative monocular depth estimation.
arXiv Detail & Related papers (2025-03-30T16:24:22Z) - Mono2Stereo: A Benchmark and Empirical Study for Stereo Conversion [88.67015254278859]
We introduce the Mono2Stereo dataset, providing high-quality training data and benchmark to support in-depth exploration of stereo conversion.
We conduct an empirical study that yields two primary findings. 1) The differences between the left and right views are subtle, yet existing metrics consider overall pixels, failing to concentrate on regions critical to stereo effects.
We introduce a new evaluation metric, Stereo Intersection-over-Union, which harmonizes disparity and achieves a high correlation with human judgments on stereo effect.
arXiv Detail & Related papers (2025-03-28T09:25:58Z) - MonoInstance: Enhancing Monocular Priors via Multi-view Instance Alignment for Neural Rendering and Reconstruction [45.70946415376022]
Monocular depth priors have been widely adopted by neural rendering in multi-view based tasks such as 3D reconstruction and novel view synthesis.<n>Current methods treat the entire estimated depth map indiscriminately, and use it as ground truth supervision.<n>We propose MonoInstance, a general approach that explores the uncertainty of monocular depths to provide enhanced geometric priors.
arXiv Detail & Related papers (2025-03-24T05:58:06Z) - Helvipad: A Real-World Dataset for Omnidirectional Stereo Depth Estimation [83.841877607646]
We introduce Helvipad, a real-world dataset for omnidirectional stereo depth estimation.<n>The dataset includes accurate depth and disparity labels by projecting 3D point clouds onto equirectangular images.<n>We benchmark leading stereo depth estimation models for both standard and omnidirectional images.
arXiv Detail & Related papers (2024-11-27T13:34:41Z) - GasMono: Geometry-Aided Self-Supervised Monocular Depth Estimation for
Indoor Scenes [47.76269541664071]
This paper tackles the challenges of self-supervised monocular depth estimation in indoor scenes caused by large rotation between frames and low texture.
We obtain coarse camera poses from monocular sequences through multi-view geometry to deal with the former.
To soften the effect of the low texture, we combine the global reasoning of vision transformers with an overfitting-aware, iterative self-distillation mechanism.
arXiv Detail & Related papers (2023-09-26T17:59:57Z) - Learning to Fuse Monocular and Multi-view Cues for Multi-frame Depth
Estimation in Dynamic Scenes [51.20150148066458]
We propose a novel method to learn to fuse the multi-view and monocular cues encoded as volumes without needing the generalizationally crafted masks.
Experiments on real-world datasets prove the significant effectiveness and ability of the proposed method.
arXiv Detail & Related papers (2023-04-18T13:55:24Z) - Bridging Stereo Geometry and BEV Representation with Reliable Mutual Interaction for Semantic Scene Completion [45.171150395915056]
3D semantic scene completion (SSC) is an ill-posed perception task that requires inferring a dense 3D scene from limited observations.
Previous camera-based methods struggle to predict accurate semantic scenes due to inherent geometric ambiguity and incomplete observations.
We resort to stereo matching technique and bird's-eye-view (BEV) representation learning to address such issues in SSC.
arXiv Detail & Related papers (2023-03-24T12:33:44Z) - 2T-UNET: A Two-Tower UNet with Depth Clues for Robust Stereo Depth
Estimation [0.2578242050187029]
This paper revisits the depth estimation problem, avoiding the explicit stereo matching step using a simple two-tower convolutional neural network.
The proposed algorithm is entitled as 2T-UNet.
The architecture performs incredibly well on complex natural scenes, highlighting its usefulness for various real-time applications.
arXiv Detail & Related papers (2022-10-27T12:34:41Z) - DSGN++: Exploiting Visual-Spatial Relation forStereo-based 3D Detectors [60.88824519770208]
Camera-based 3D object detectors are welcome due to their wider deployment and lower price than LiDAR sensors.
We revisit the prior stereo modeling DSGN about the stereo volume constructions for representing both 3D geometry and semantics.
We propose our approach, DSGN++, aiming for improving information flow throughout the 2D-to-3D pipeline.
arXiv Detail & Related papers (2022-04-06T18:43:54Z) - Improving Monocular Visual Odometry Using Learned Depth [84.05081552443693]
We propose a framework to exploit monocular depth estimation for improving visual odometry (VO)
The core of our framework is a monocular depth estimation module with a strong generalization capability for diverse scenes.
Compared with current learning-based VO methods, our method demonstrates a stronger generalization ability to diverse scenes.
arXiv Detail & Related papers (2022-04-04T06:26:46Z) - H-Net: Unsupervised Attention-based Stereo Depth Estimation Leveraging
Epipolar Geometry [4.968452390132676]
We introduce the H-Net, a deep-learning framework for unsupervised stereo depth estimation.
For the first time, a Siamese autoencoder architecture is used for depth estimation.
Our method outperforms the state-ofthe-art unsupervised stereo depth estimation methods.
arXiv Detail & Related papers (2021-04-22T19:16:35Z) - SMD-Nets: Stereo Mixture Density Networks [68.56947049719936]
We propose Stereo Mixture Density Networks (SMD-Nets), a simple yet effective learning framework compatible with a wide class of 2D and 3D architectures.
Specifically, we exploit bimodal mixture densities as output representation and show that this allows for sharp and precise disparity estimates near discontinuities.
We carry out comprehensive experiments on a new high-resolution and highly realistic synthetic stereo dataset, consisting of stereo pairs at 8Mpx resolution, as well as on real-world stereo datasets.
arXiv Detail & Related papers (2021-04-08T16:15:46Z) - Reversing the cycle: self-supervised deep stereo through enhanced
monocular distillation [51.714092199995044]
In many fields, self-supervised learning solutions are rapidly evolving and filling the gap with supervised approaches.
We propose a novel self-supervised paradigm reversing the link between the two.
In order to train deep stereo networks, we distill knowledge through a monocular completion network.
arXiv Detail & Related papers (2020-08-17T07:40:22Z) - Increased-Range Unsupervised Monocular Depth Estimation [8.105699831214608]
In this work, we propose to integrate the advantages of the small and wide baselines.
By training the network using three horizontally aligned views, we obtain accurate depth predictions for both close and far ranges.
Our strategy allows to infer multi-baseline depth from a single image.
arXiv Detail & Related papers (2020-06-23T07:01:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.