Related papers: OmniDepth: Bridging Monocular and Stereo Reasoning with Latent Alignment

OmniDepth: Bridging Monocular and Stereo Reasoning with Latent Alignment

URL: http://arxiv.org/abs/2508.04611v1
Date: Wed, 06 Aug 2025 16:31:22 GMT
Title: OmniDepth: Bridging Monocular and Stereo Reasoning with Latent Alignment
Authors: Tongfan Guan, Jiaxin Guo, Chen Wang, Yun-Hui Liu,
Abstract summary: We introduce OmniDepth, a unified framework that bridges monocular and stereo approaches to 3D estimation.<n>At its core, a novel cross-attentive alignment mechanism dynamically synchronizes monocular contextual cues with stereo hypothesis representations.<n>This mutual alignment resolves stereo ambiguities (e.g., specular surfaces) by injecting monocular structure priors while refining monocular depth with stereo geometry.
Score: 31.118114556998048
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Monocular and stereo depth estimation offer complementary strengths: monocular methods capture rich contextual priors but lack geometric precision, while stereo approaches leverage epipolar geometry yet struggle with ambiguities such as reflective or textureless surfaces. Despite post-hoc synergies, these paradigms remain largely disjoint in practice. We introduce OmniDepth, a unified framework that bridges both through iterative bidirectional alignment of their latent representations. At its core, a novel cross-attentive alignment mechanism dynamically synchronizes monocular contextual cues with stereo hypothesis representations during stereo reasoning. This mutual alignment resolves stereo ambiguities (e.g., specular surfaces) by injecting monocular structure priors while refining monocular depth with stereo geometry within a single network. Extensive experiments demonstrate state-of-the-art results: \textbf{OmniDepth reduces zero-shot generalization error by $\!>\!40\%$ on Middlebury and ETH3D}, while addressing longstanding failures on transparent and reflective surfaces. By harmonizing multi-view geometry with monocular context, OmniDepth enables robust 3D perception that transcends modality-specific limitations. Codes available at https://github.com/aeolusguan/OmniDepth.

Related papers

Boosting Omnidirectional Stereo Matching with a Pre-trained Depth Foundation Model [62.37493746544967]
Camera-based setups offer a cost-effective option by using stereo depth estimation to generate dense, high-resolution depth maps.<n>Existing omnidirectional stereo matching approaches achieve only limited depth accuracy across diverse environments.<n>We present DFI-OmniStereo, a novel omnidirectional stereo matching method that leverages a large-scale pre-trained foundation model for relative monocular depth estimation.
arXiv Detail & Related papers (2025-03-30T16:24:22Z)
MonoInstance: Enhancing Monocular Priors via Multi-view Instance Alignment for Neural Rendering and Reconstruction [45.70946415376022]
Monocular depth priors have been widely adopted by neural rendering in multi-view based tasks such as 3D reconstruction and novel view synthesis.<n>Current methods treat the entire estimated depth map indiscriminately, and use it as ground truth supervision.<n>We propose MonoInstance, a general approach that explores the uncertainty of monocular depths to provide enhanced geometric priors.
arXiv Detail & Related papers (2025-03-24T05:58:06Z)
Stereo Anywhere: Robust Zero-Shot Deep Stereo Matching Even Where Either Stereo or Mono Fail [37.90622613373521]
We introduce Stereo Anywhere, a novel stereo-matching framework that combines geometric constraints with robust priors from monocular depth Vision Foundation Models (VFMs)<n>We show that our synthetic-only trained model achieves state-of-the-art results in zero-shot generalization, significantly outperforming existing solutions.
arXiv Detail & Related papers (2024-12-05T18:59:58Z)
Single-View View Synthesis with Self-Rectified Pseudo-Stereo [49.946151180828465]
We leverage the reliable and explicit stereo prior to generate a pseudo-stereo viewpoint. We propose a self-rectified stereo synthesis to amend erroneous regions in an identify-rectify manner. Our method outperforms state-of-the-art single-view view synthesis methods and stereo synthesis methods.
arXiv Detail & Related papers (2023-04-19T09:36:13Z)
Bridging Stereo Geometry and BEV Representation with Reliable Mutual Interaction for Semantic Scene Completion [45.171150395915056]
3D semantic scene completion (SSC) is an ill-posed perception task that requires inferring a dense 3D scene from limited observations. Previous camera-based methods struggle to predict accurate semantic scenes due to inherent geometric ambiguity and incomplete observations. We resort to stereo matching technique and bird's-eye-view (BEV) representation learning to address such issues in SSC.
arXiv Detail & Related papers (2023-03-24T12:33:44Z)
Self-Supervised Depth Estimation in Laparoscopic Image using 3D Geometric Consistency [7.902636435901286]
We present M3Depth, a self-supervised depth estimator to leverage 3D geometric structural information hidden in stereo pairs. Our method outperforms previous self-supervised approaches on both a public dataset and a newly acquired dataset by a large margin.
arXiv Detail & Related papers (2022-08-17T17:03:48Z)
PanoDepth: A Two-Stage Approach for Monocular Omnidirectional Depth Estimation [11.66493799838823]
We propose a novel, model-agnostic, two-stage pipeline for omnidirectional monocular depth estimation. Our framework PanoDepth takes one 360 image as input, produces one or more synthesized views in the first stage, and feeds the original image and the synthesized images into the subsequent stereo matching stage. Our results show that PanoDepth outperforms the state-of-the-art approaches by a large margin for 360 monocular depth estimation.
arXiv Detail & Related papers (2022-02-02T23:08:06Z)
SMD-Nets: Stereo Mixture Density Networks [68.56947049719936]
We propose Stereo Mixture Density Networks (SMD-Nets), a simple yet effective learning framework compatible with a wide class of 2D and 3D architectures. Specifically, we exploit bimodal mixture densities as output representation and show that this allows for sharp and precise disparity estimates near discontinuities. We carry out comprehensive experiments on a new high-resolution and highly realistic synthetic stereo dataset, consisting of stereo pairs at 8Mpx resolution, as well as on real-world stereo datasets.
arXiv Detail & Related papers (2021-04-08T16:15:46Z)
Reversing the cycle: self-supervised deep stereo through enhanced monocular distillation [51.714092199995044]
In many fields, self-supervised learning solutions are rapidly evolving and filling the gap with supervised approaches. We propose a novel self-supervised paradigm reversing the link between the two. In order to train deep stereo networks, we distill knowledge through a monocular completion network.
arXiv Detail & Related papers (2020-08-17T07:40:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.