Related papers: Monocular absolute depth estimation from endoscopy via domain-invariant feature learning and latent consistency

Monocular absolute depth estimation from endoscopy via domain-invariant feature learning and latent consistency

URL: http://arxiv.org/abs/2511.02247v1
Date: Tue, 04 Nov 2025 04:25:15 GMT
Title: Monocular absolute depth estimation from endoscopy via domain-invariant feature learning and latent consistency
Authors: Hao Li, Daiwei Lu, Jesse d'Almeida, Dilara Isik, Ehsan Khodapanah Aghdam, Nick DiSanto, Ayberk Acar, Susheela Sharma, Jie Ying Wu, Robert J. Webster III, Ipek Oguz,
Abstract summary: Current image-level unsupervised domain adaptation methods translate synthetic images with known depth maps into the style of real endoscopic frames.<n>We present a latent feature alignment method to improve absolute depth estimation by reducing this domain gap in the context of endoscopic videos of the central airway.<n>Compared to state-of-the-art MDE methods, our approach achieves superior performance on both absolute and relative depth metrics.
Score: 5.257607423828341
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Monocular depth estimation (MDE) is a critical task to guide autonomous medical robots. However, obtaining absolute (metric) depth from an endoscopy camera in surgical scenes is difficult, which limits supervised learning of depth on real endoscopic images. Current image-level unsupervised domain adaptation methods translate synthetic images with known depth maps into the style of real endoscopic frames and train depth networks using these translated images with their corresponding depth maps. However a domain gap often remains between real and translated synthetic images. In this paper, we present a latent feature alignment method to improve absolute depth estimation by reducing this domain gap in the context of endoscopic videos of the central airway. Our methods are agnostic to the image translation process and focus on the depth estimation itself. Specifically, the depth network takes translated synthetic and real endoscopic frames as input and learns latent domain-invariant features via adversarial learning and directional feature consistency. The evaluation is conducted on endoscopic videos of central airway phantoms with manually aligned absolute depth maps. Compared to state-of-the-art MDE methods, our approach achieves superior performance on both absolute and relative depth metrics, and consistently improves results across various backbones and pretrained weights. Our code is available at https://github.com/MedICL-VU/MDE.

Related papers

EndoDDC: Learning Sparse to Dense Reconstruction for Endoscopic Robotic Navigation via Diffusion Depth Completion [15.100363020538852]
We propose EndoDDC, an endoscopy depth completion method that integrates images, sparse depth information with depth gradient features.<n>Our approach outperforms state-of-the-art models in both depth accuracy and robustness.
arXiv Detail & Related papers (2026-02-25T13:21:49Z)
StarryGazer: Leveraging Monocular Depth Estimation Models for Domain-Agnostic Single Depth Image Completion [56.28564075246147]
StarryGazer is a framework that predicts dense depth images from a single sparse depth image and an RGB image.<n>We employ a pre-trained MDE model to produce relative depth images.<n>A refinement network is trained with the synthetic pairs, incorporating the relative depth maps and RGB images to improve the model's accuracy and robustness.
arXiv Detail & Related papers (2025-12-15T09:56:09Z)
EndoMUST: Monocular Depth Estimation for Robotic Endoscopy via End-to-end Multi-step Self-supervised Training [0.7499722271664147]
A novel framework with multistep efficient finetuning is proposed in this work.<n>Based on parameter-efficient finetuning on the foundation model, the proposed method achieves state-of-the-art performance.
arXiv Detail & Related papers (2025-06-19T04:31:59Z)
MetaFE-DE: Learning Meta Feature Embedding for Depth Estimation from Monocular Endoscopic Images [18.023231290573268]
Existing methods primarily estimate the depth information from RGB images directly.<n>We introduce a novel concept referred as meta feature embedding (MetaFE)"<n>In this paper, we propose a two-stage self-supervised learning paradigm for the monocular endoscopic depth estimation.
arXiv Detail & Related papers (2025-02-05T02:52:30Z)
Marigold-DC: Zero-Shot Monocular Depth Completion with Guided Diffusion [57.08169927189237]
Existing methods for depth completion operate in tightly constrained settings.<n>Inspired by advances in monocular depth estimation, we reframe depth completion as an image-conditional depth map generation.<n>Marigold-DC builds on a pretrained latent diffusion model for monocular depth estimation and injects the depth observations as test-time guidance.
arXiv Detail & Related papers (2024-12-18T00:06:41Z)
Structure-preserving Image Translation for Depth Estimation in Colonoscopy Video [1.0485739694839669]
We propose a pipeline of structure-preserving synthetic-to-real (sim2real) image translation. This allows us to generate large quantities of realistic-looking synthetic images for supervised depth estimation. We also propose a dataset of hand-picked sequences from clinical colonoscopies to improve the image translation process.
arXiv Detail & Related papers (2024-08-19T17:02:16Z)
Towards Domain-agnostic Depth Completion [28.25756709062647]
Existing depth completion methods are often targeted at a specific sparse depth type and generalize poorly across task domains. We present a method to complete sparse/semi-dense, noisy, and potentially low-resolution depth maps obtained by various range sensors. Our method shows superior cross-domain generalization ability against state-of-the-art depth completion methods.
arXiv Detail & Related papers (2022-07-29T04:10:22Z)
SelfTune: Metrically Scaled Monocular Depth Estimation through Self-Supervised Learning [53.78813049373321]
We propose a self-supervised learning method for the pre-trained supervised monocular depth networks to enable metrically scaled depth estimation. Our approach is useful for various applications such as mobile robot navigation and is applicable to diverse environments.
arXiv Detail & Related papers (2022-03-10T12:28:42Z)
Learning Depth via Leveraging Semantics: Self-supervised Monocular Depth Estimation with Both Implicit and Explicit Semantic Guidance [34.62415122883441]
We propose a Semantic-aware Spatial Feature Alignment scheme to align implicit semantic features with depth features for scene-aware depth estimation. We also propose a semantic-guided ranking loss to explicitly constrain the estimated depth maps to be consistent with real scene contextual properties. Our method produces high quality depth maps which are consistently superior either on complex scenes or diverse semantic categories.
arXiv Detail & Related papers (2021-02-11T14:29:51Z)
SelfDeco: Self-Supervised Monocular Depth Completion in Challenging Indoor Environments [50.761917113239996]
We present a novel algorithm for self-supervised monocular depth completion. Our approach is based on training a neural network that requires only sparse depth measurements and corresponding monocular video sequences without dense depth labels. Our self-supervised algorithm is designed for challenging indoor environments with textureless regions, glossy and transparent surface, non-Lambertian surfaces, moving people, longer and diverse depth ranges and scenes captured by complex ego-motions.
arXiv Detail & Related papers (2020-11-10T08:55:07Z)
Novel View Synthesis of Dynamic Scenes with Globally Coherent Depths from a Monocular Camera [93.04135520894631]
This paper presents a new method to synthesize an image from arbitrary views and times given a collection of images of a dynamic scene. A key challenge for the novel view synthesis arises from dynamic scene reconstruction where epipolar geometry does not apply to the local motion of dynamic contents. To address this challenge, we propose to combine the depth from single view (DSV) and the depth from multi-view stereo (DMV), where DSV is complete, i.e., a depth is assigned to every pixel, yet view-variant in its scale, while DMV is view-invariant yet incomplete.
arXiv Detail & Related papers (2020-04-02T22:45:53Z)
Don't Forget The Past: Recurrent Depth Estimation from Monocular Video [92.84498980104424]
We put three different types of depth estimation into a common framework. Our method produces a time series of depth maps. It can be applied to monocular videos only or be combined with different types of sparse depth patterns.
arXiv Detail & Related papers (2020-01-08T16:50:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.