Related papers: EndoSfM3D: Learning to 3D Reconstruct Any Endoscopic Surgery Scene using Self-supervised Foundation Model

EndoSfM3D: Learning to 3D Reconstruct Any Endoscopic Surgery Scene using Self-supervised Foundation Model

URL: http://arxiv.org/abs/2510.22359v1
Date: Sat, 25 Oct 2025 16:39:04 GMT
Title: EndoSfM3D: Learning to 3D Reconstruct Any Endoscopic Surgery Scene using Self-supervised Foundation Model
Authors: Changhao Zhang, Matthew J. Clarkson, Mobarak I. Hoque,
Abstract summary: 3D reconstruction of endoscopic surgery scenes plays a vital role in enhancing scene perception, enabling AR visualization, and supporting context-aware decision-making in image-guided surgery.<n>In intrinsic calibration is hindered by sterility constraints and the use of specialized endoscopes with continuous zoom and telescope rotation.<n>In this paper, we integrate intrinsic parameter estimation into a self-supervised monocular depth estimation framework by adapting the Depth Anything V2 (DA2) model for joint depth, pose, and intrinsics prediction.<n>Our method is validated on the SCARED and C3VD public datasets, demonstrating superior performance compared to recent state-of-the
Score: 2.8913847481700667
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: 3D reconstruction of endoscopic surgery scenes plays a vital role in enhancing scene perception, enabling AR visualization, and supporting context-aware decision-making in image-guided surgery. A critical yet challenging step in this process is the accurate estimation of the endoscope's intrinsic parameters. In real surgical settings, intrinsic calibration is hindered by sterility constraints and the use of specialized endoscopes with continuous zoom and telescope rotation. Most existing methods for endoscopic 3D reconstruction do not estimate intrinsic parameters, limiting their effectiveness for accurate and reliable reconstruction. In this paper, we integrate intrinsic parameter estimation into a self-supervised monocular depth estimation framework by adapting the Depth Anything V2 (DA2) model for joint depth, pose, and intrinsics prediction. We introduce an attention-based pose network and a Weight-Decomposed Low-Rank Adaptation (DoRA) strategy for efficient fine-tuning of DA2. Our method is validated on the SCARED and C3VD public datasets, demonstrating superior performance compared to recent state-of-the-art approaches in self-supervised monocular depth estimation and 3D reconstruction. Code and model weights can be found in project repository: https://github.com/MOYF-beta/EndoSfM3D.

Related papers

Preoperative-to-intraoperative Liver Registration for Laparoscopic Surgery via Latent-Grounded Correspondence Constraints [51.7011449975586]
Land-Reg is a deformable registration framework that learns latent-grounded 2D-3D landmark correspondences.<n>For rigid registration, Land-Reg embraces a Cross-modal Latent Alignment module.<n>An Uncertainty-enhanced Overlap Landmark Detector with similarity matching is proposed to robustly estimate explicit 2D-3D landmark correspondences.
arXiv Detail & Related papers (2026-03-02T10:44:03Z)
EndoUFM: Utilizing Foundation Models for Monocular depth estimation of endoscopic images [7.350425834778092]
EndoUFM is an unsupervised monocular depth estimation framework.<n>It enhances the depth estimation performance by leveraging the powerful pre-learned priors.<n>This work contributes to augmenting surgeons' spatial perception during minimally invasive procedures.
arXiv Detail & Related papers (2025-08-25T11:33:05Z)
IXGS-Intraoperative 3D Reconstruction from Sparse, Arbitrarily Posed Real X-rays [1.2721397985664153]
We extend the $R2$-Gaussian splatting framework to reconstruct consistent 3D volumes under challenging conditions.<n>We introduce an anatomy-guided radiographic standardization step using style transfer, improving visual consistency across views.
arXiv Detail & Related papers (2025-04-20T18:28:13Z)
Learning to Efficiently Adapt Foundation Models for Self-Supervised Endoscopic 3D Scene Reconstruction from Any Cameras [41.985581990753765]
We introduce Endo3DAC, a unified framework for endoscopic scene reconstruction.<n>We design an integrated network capable of simultaneously estimating depth maps, relative poses, and camera intrinsic parameters.<n>Experiments across four endoscopic datasets demonstrate that Endo3DAC significantly outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2025-03-20T07:49:04Z)
Advancing Depth Anything Model for Unsupervised Monocular Depth Estimation in Endoscopy [2.906891207990726]
We introduce a novel fine-tuning strategy for the Depth Anything Model.<n>We integrate it with an intrinsic-based unsupervised monocular depth estimation framework.<n>Our results show that our method achieves state-of-the-art performance while minimizing the number of trainable parameters.
arXiv Detail & Related papers (2024-09-12T03:04:43Z)
GEOcc: Geometrically Enhanced 3D Occupancy Network with Implicit-Explicit Depth Fusion and Contextual Self-Supervision [49.839374549646884]
This paper presents GEOcc, a Geometric-Enhanced Occupancy network tailored for vision-only surround-view perception.<n>Our approach achieves State-Of-The-Art performance on the Occ3D-nuScenes dataset with the least image resolution needed and the most weightless image backbone.
arXiv Detail & Related papers (2024-05-17T07:31:20Z)
EndoDAC: Efficient Adapting Foundation Model for Self-Supervised Depth Estimation from Any Endoscopic Camera [12.152362025172915]
We propose Endoscopic Depth Any Camera (EndoDAC) to adapt foundation models to endoscopic scenes. Specifically, we develop the Dynamic Vector-Based Low-Rank Adaptation (DV-LoRA) and employ Convolutional Neck blocks. Our framework is capable of being trained solely on monocular surgical videos from any camera, ensuring minimal training costs.
arXiv Detail & Related papers (2024-05-14T14:55:15Z)
FLex: Joint Pose and Dynamic Radiance Fields Optimization for Stereo Endoscopic Videos [79.50191812646125]
Reconstruction of endoscopic scenes is an important asset for various medical applications, from post-surgery analysis to educational training. We adress the challenging setup of a moving endoscope within a highly dynamic environment of deforming tissue. We propose an implicit scene separation into multiple overlapping 4D neural radiance fields (NeRFs) and a progressive optimization scheme jointly optimizing for reconstruction and camera poses from scratch. This improves the ease-of-use and allows to scale reconstruction capabilities in time to process surgical videos of 5,000 frames and more; an improvement of more than ten times compared to the state of the art while being agnostic to external tracking information
arXiv Detail & Related papers (2024-03-18T19:13:02Z)
Next-generation Surgical Navigation: Marker-less Multi-view 6DoF Pose Estimation of Surgical Instruments [64.59698930334012]
We present a multi-camera capture setup consisting of static and head-mounted cameras.<n>Second, we publish a multi-view RGB-D video dataset of ex-vivo spine surgeries, captured in a surgical wet lab and a real operating theatre.<n>Third, we evaluate three state-of-the-art single-view and multi-view methods for the task of 6DoF pose estimation of surgical instruments.
arXiv Detail & Related papers (2023-05-05T13:42:19Z)
Adversarial Domain Feature Adaptation for Bronchoscopic Depth Estimation [111.89519571205778]
In this work, we propose an alternative domain-adaptive approach to depth estimation. Our novel two-step structure first trains a depth estimation network with labeled synthetic images in a supervised manner. The results of our experiments show that the proposed method improves the network's performance on real images by a considerable margin.
arXiv Detail & Related papers (2021-09-24T08:11:34Z)
Tattoo tomography: Freehand 3D photoacoustic image reconstruction with an optical pattern [49.240017254888336]
Photoacoustic tomography (PAT) is a novel imaging technique that can resolve both morphological and functional tissue properties. A current drawback is the limited field-of-view provided by the conventionally applied 2D probes. We present a novel approach to 3D reconstruction of PAT data that does not require an external tracking system.
arXiv Detail & Related papers (2020-11-10T09:27:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.