Related papers: EndoUFM: Utilizing Foundation Models for Monocular depth estimation of endoscopic images

EndoUFM: Utilizing Foundation Models for Monocular depth estimation of endoscopic images

URL: http://arxiv.org/abs/2508.17916v1
Date: Mon, 25 Aug 2025 11:33:05 GMT
Title: EndoUFM: Utilizing Foundation Models for Monocular depth estimation of endoscopic images
Authors: Xinning Yao, Bo Liu, Bojian Li, Jingjing Wang, Jinghua Yue, Fugen Zhou,
Abstract summary: EndoUFM is an unsupervised monocular depth estimation framework.<n>It enhances the depth estimation performance by leveraging the powerful pre-learned priors.<n>This work contributes to augmenting surgeons' spatial perception during minimally invasive procedures.
Score: 7.350425834778092
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Depth estimation is a foundational component for 3D reconstruction in minimally invasive endoscopic surgeries. However, existing monocular depth estimation techniques often exhibit limited performance to the varying illumination and complex textures of the surgical environment. While powerful visual foundation models offer a promising solution, their training on natural images leads to significant domain adaptability limitations and semantic perception deficiencies when applied to endoscopy. In this study, we introduce EndoUFM, an unsupervised monocular depth estimation framework that innovatively integrating dual foundation models for surgical scenes, which enhance the depth estimation performance by leveraging the powerful pre-learned priors. The framework features a novel adaptive fine-tuning strategy that incorporates Random Vector Low-Rank Adaptation (RVLoRA) to enhance model adaptability, and a Residual block based on Depthwise Separable Convolution (Res-DSC) to improve the capture of fine-grained local features. Furthermore, we design a mask-guided smoothness loss to enforce depth consistency within anatomical tissue structures. Extensive experiments on the SCARED, Hamlyn, SERV-CT, and EndoNeRF datasets confirm that our method achieves state-of-the-art performance while maintaining an efficient model size. This work contributes to augmenting surgeons' spatial perception during minimally invasive procedures, thereby enhancing surgical precision and safety, with crucial implications for augmented reality and navigation systems.

Related papers

Self-Supervised Contrastive Embedding Adaptation for Endoscopic Image Matching [7.674595072442547]
This research presents a novel Deep Learning pipeline for establishing feature correspondences in endoscopic image pairs.<n>The proposed methodology leverages a novel-view synthesis pipeline to generate ground-truth inlier correspondences.<n>Our pipeline surpasses state-of-the-art methodologies on the SCARED datasets improved matching precision and lower epipolar error.
arXiv Detail & Related papers (2025-12-11T07:44:00Z)
EndoSfM3D: Learning to 3D Reconstruct Any Endoscopic Surgery Scene using Self-supervised Foundation Model [2.8913847481700667]
3D reconstruction of endoscopic surgery scenes plays a vital role in enhancing scene perception, enabling AR visualization, and supporting context-aware decision-making in image-guided surgery.<n>In intrinsic calibration is hindered by sterility constraints and the use of specialized endoscopes with continuous zoom and telescope rotation.<n>In this paper, we integrate intrinsic parameter estimation into a self-supervised monocular depth estimation framework by adapting the Depth Anything V2 (DA2) model for joint depth, pose, and intrinsics prediction.<n>Our method is validated on the SCARED and C3VD public datasets, demonstrating superior performance compared to recent state-of-the
arXiv Detail & Related papers (2025-10-25T16:39:04Z)
Unifying Scale-Aware Depth Prediction and Perceptual Priors for Monocular Endoscope Pose Estimation and Tissue Reconstruction [3.251946340142663]
A unified framework for monocular endoscopic tissue reconstruction is presented.<n>It integrates scale-aware depth prediction with temporally-constrained perceptual refinement.<n> Evaluations on HEVD and SCARED, with ablation and comparative analyses, demonstrate the framework's robustness and superiority over state-of-the-art methods.
arXiv Detail & Related papers (2025-08-15T07:41:17Z)
Advancing Depth Anything Model for Unsupervised Monocular Depth Estimation in Endoscopy [2.906891207990726]
We introduce a novel fine-tuning strategy for the Depth Anything Model.<n>We integrate it with an intrinsic-based unsupervised monocular depth estimation framework.<n>Our results show that our method achieves state-of-the-art performance while minimizing the number of trainable parameters.
arXiv Detail & Related papers (2024-09-12T03:04:43Z)
ToDER: Towards Colonoscopy Depth Estimation and Reconstruction with Geometry Constraint Adaptation [67.22294293695255]
We propose a novel reconstruction pipeline with a bi-directional adaptation architecture named ToDER to get precise depth estimations. Experimental results demonstrate that our approach can precisely predict depth maps in both realistic and synthetic colonoscopy videos.
arXiv Detail & Related papers (2024-07-23T14:24:26Z)
FLex: Joint Pose and Dynamic Radiance Fields Optimization for Stereo Endoscopic Videos [79.50191812646125]
Reconstruction of endoscopic scenes is an important asset for various medical applications, from post-surgery analysis to educational training. We adress the challenging setup of a moving endoscope within a highly dynamic environment of deforming tissue. We propose an implicit scene separation into multiple overlapping 4D neural radiance fields (NeRFs) and a progressive optimization scheme jointly optimizing for reconstruction and camera poses from scratch. This improves the ease-of-use and allows to scale reconstruction capabilities in time to process surgical videos of 5,000 frames and more; an improvement of more than ten times compared to the state of the art while being agnostic to external tracking information
arXiv Detail & Related papers (2024-03-18T19:13:02Z)
Surgical-DINO: Adapter Learning of Foundation Models for Depth Estimation in Endoscopic Surgery [12.92291406687467]
We design a foundation model-based depth estimation method, referred to as Surgical-DINO, a low-rank adaptation of the DINOv2 for depth estimation in endoscopic surgery. We build LoRA layers and integrate them into DINO to adapt with surgery-specific domain knowledge instead of conventional fine-tuning. Our model is extensively validated on a MICCAI challenge dataset of SCARED, which is collected from da Vinci Xi endoscope surgery.
arXiv Detail & Related papers (2024-01-11T16:22:42Z)
Neural LerPlane Representations for Fast 4D Reconstruction of Deformable Tissues [52.886545681833596]
LerPlane is a novel method for fast and accurate reconstruction of surgical scenes under a single-viewpoint setting. LerPlane treats surgical procedures as 4D volumes and factorizes them into explicit 2D planes of static and dynamic fields. LerPlane shares static fields, significantly reducing the workload of dynamic tissue modeling.
arXiv Detail & Related papers (2023-05-31T14:38:35Z)
Deep Unrolled Recovery in Sparse Biological Imaging [62.997667081978825]
Deep algorithm unrolling is a model-based approach to develop deep architectures that combine the interpretability of iterative algorithms with the performance gains of supervised deep learning. This framework is well-suited to applications in biological imaging, where physics-based models exist to describe the measurement process and the information to be recovered is often highly structured.
arXiv Detail & Related papers (2021-09-28T20:22:44Z)
Adversarial Domain Feature Adaptation for Bronchoscopic Depth Estimation [111.89519571205778]
In this work, we propose an alternative domain-adaptive approach to depth estimation. Our novel two-step structure first trains a depth estimation network with labeled synthetic images in a supervised manner. The results of our experiments show that the proposed method improves the network's performance on real images by a considerable margin.
arXiv Detail & Related papers (2021-09-24T08:11:34Z)
NerfingMVS: Guided Optimization of Neural Radiance Fields for Indoor Multi-view Stereo [97.07453889070574]
We present a new multi-view depth estimation method that utilizes both conventional SfM reconstruction and learning-based priors. We show that our proposed framework significantly outperforms state-of-the-art methods on indoor scenes.
arXiv Detail & Related papers (2021-09-02T17:54:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.