Multi-Modal Monocular Endoscopic Depth and Pose Estimation with Edge-Guided Self-Supervision
- URL: http://arxiv.org/abs/2602.17785v1
- Date: Thu, 19 Feb 2026 19:38:11 GMT
- Title: Multi-Modal Monocular Endoscopic Depth and Pose Estimation with Edge-Guided Self-Supervision
- Authors: Xinwei Ju, Rema Daher, Danail Stoyanov, Sophia Bano, Francisco Vasconcelos,
- Abstract summary: Monocular depth and pose estimation play an important role in the development of colonoscopy-assisted navigation.<n>We propose **PRISM**, a self-supervised learning framework that leverages anatomical and illumination priors to guide geometric learning.
- Score: 11.141482696146275
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Monocular depth and pose estimation play an important role in the development of colonoscopy-assisted navigation, as they enable improved screening by reducing blind spots, minimizing the risk of missed or recurrent lesions, and lowering the likelihood of incomplete examinations. However, this task remains challenging due to the presence of texture-less surfaces, complex illumination patterns, deformation, and a lack of in-vivo datasets with reliable ground truth. In this paper, we propose **PRISM** (Pose-Refinement with Intrinsic Shading and edge Maps), a self-supervised learning framework that leverages anatomical and illumination priors to guide geometric learning. Our approach uniquely incorporates edge detection and luminance decoupling for structural guidance. Specifically, edge maps are derived using a learning-based edge detector (e.g., DexiNed or HED) trained to capture thin and high-frequency boundaries, while luminance decoupling is obtained through an intrinsic decomposition module that separates shading and reflectance, enabling the model to exploit shading cues for depth estimation. Experimental results on multiple real and synthetic datasets demonstrate state-of-the-art performance. We further conduct a thorough ablation study on training data selection to establish best practices for pose and depth estimation in colonoscopy. This analysis yields two practical insights: (1) self-supervised training on real-world data outperforms supervised training on realistic phantom data, underscoring the superiority of domain realism over ground truth availability; and (2) video frame rate is an extremely important factor for model performance, where dataset-specific video frame sampling is necessary for generating high quality training data.
Related papers
- UM-Depth : Uncertainty Masked Self-Supervised Monocular Depth Estimation with Visual Odometry [3.8323580808203785]
We introduce UM-Depth, a framework that combines motion- and uncertainty-aware refinement to enhance depth accuracy.<n>We develop a teacher training strategy that embeds uncertainty estimation into both the training pipeline and network architecture.<n> UM-Depth achieves state-of-the-art results in both self-supervised depth and pose estimation on the KITTI datasets.
arXiv Detail & Related papers (2025-09-17T05:51:07Z) - Always Clear Depth: Robust Monocular Depth Estimation under Adverse Weather [48.65180004211851]
We present a robust monocular depth estimation method called textbfACDepth from the perspective of high-quality training data generation and domain adaptation.<n>Specifically, we introduce a one-step diffusion model for generating samples that simulate adverse weather conditions, constructing a multi-tuple degradation dataset during training.<n>We elaborate on a multi-granularity knowledge distillation strategy (MKD) that encourages the student network to absorb knowledge from both the teacher model and pretrained Depth Anything V2.
arXiv Detail & Related papers (2025-05-18T02:30:47Z) - Occlusion-Aware Self-Supervised Monocular Depth Estimation for Weak-Texture Endoscopic Images [1.1084686909647639]
We propose a self-supervised monocular depth estimation network tailored for endoscopic scenes.<n>Existing methods, though accurate, typically assume consistent illumination.<n>These variations lead to incorrect geometric interpretations and unreliable self-supervised signals.
arXiv Detail & Related papers (2025-04-24T14:12:57Z) - Leveraging Stable Diffusion for Monocular Depth Estimation via Image Semantic Encoding [1.0445560141983634]
We propose a novel image-based semantic embedding that extracts contextual information directly from visual features.<n>Our method achieves performance comparable to state-of-the-art models while addressing the shortcomings of CLIP embeddings in handling outdoor scenes.
arXiv Detail & Related papers (2025-02-01T15:37:22Z) - Unveiling Deep Shadows: A Survey and Benchmark on Image and Video Shadow Detection, Removal, and Generation in the Deep Learning Era [81.15890262168449]
Shadows are created when light encounters obstacles, resulting in regions of reduced illumination.<n>This paper offers a benchmark on shadow detection, removal, and generation in both images and videos.<n>It focuses on the deep learning approaches of the past decade.
arXiv Detail & Related papers (2024-09-03T17:59:05Z) - Robust Geometry-Preserving Depth Estimation Using Differentiable
Rendering [93.94371335579321]
We propose a learning framework that trains models to predict geometry-preserving depth without requiring extra data or annotations.
Comprehensive experiments underscore our framework's superior generalization capabilities.
Our innovative loss functions empower the model to autonomously recover domain-specific scale-and-shift coefficients.
arXiv Detail & Related papers (2023-09-18T12:36:39Z) - Learning to Simulate Realistic LiDARs [66.7519667383175]
We introduce a pipeline for data-driven simulation of a realistic LiDAR sensor.
We show that our model can learn to encode realistic effects such as dropped points on transparent surfaces.
We use our technique to learn models of two distinct LiDAR sensors and use them to improve simulated LiDAR data accordingly.
arXiv Detail & Related papers (2022-09-22T13:12:54Z) - Learnable Patchmatch and Self-Teaching for Multi-Frame Depth Estimation in Monocular Endoscopy [16.233423010425355]
We propose a novel unsupervised multi-frame monocular depth estimation model.<n>The proposed model integrates a learnable patchmatch module to adaptively increase the discriminative ability in regions with low and homogeneous textures.<n>As a byproduct of the self-teaching paradigm, the proposed model is able to improve the depth predictions when more frames are input at test time.
arXiv Detail & Related papers (2022-05-30T12:11:03Z) - Occlusion-aware Unsupervised Learning of Depth from 4-D Light Fields [50.435129905215284]
We present an unsupervised learning-based depth estimation method for 4-D light field processing and analysis.
Based on the basic knowledge of the unique geometry structure of light field data, we explore the angular coherence among subsets of the light field views to estimate depth maps.
Our method can significantly shrink the performance gap between the previous unsupervised method and supervised ones, and produce depth maps with comparable accuracy to traditional methods with obviously reduced computational cost.
arXiv Detail & Related papers (2021-06-06T06:19:50Z) - Unsupervised Scale-consistent Depth Learning from Video [131.3074342883371]
We propose a monocular depth estimator SC-Depth, which requires only unlabelled videos for training.
Thanks to the capability of scale-consistent prediction, we show that our monocular-trained deep networks are readily integrated into the ORB-SLAM2 system.
The proposed hybrid Pseudo-RGBD SLAM shows compelling results in KITTI, and it generalizes well to the KAIST dataset without additional training.
arXiv Detail & Related papers (2021-05-25T02:17:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.