Zero-shot Monocular Metric Depth for Endoscopic Images
- URL: http://arxiv.org/abs/2509.18642v1
- Date: Tue, 23 Sep 2025 04:56:25 GMT
- Title: Zero-shot Monocular Metric Depth for Endoscopic Images
- Authors: Nicolas Toussaint, Emanuele Colleoni, Ricardo Sanchez-Matilla, Joshua Sutcliffe, Vanessa Thompson, Muhammad Asad, Imanol Luengo, Danail Stoyanov,
- Abstract summary: We present a benchmark of state-of-the-art (metric and relative) depth estimation models evaluated on real, unseen endoscopic images.<n>We publish a novel synthetic dataset (Endo Synth) of endoscopic surgical instruments paired with ground truth metric depth and segmentation masks.<n>We demonstrate that fine-tuning depth foundation models using our synthetic dataset boosts accuracy on most unseen real data by a significant margin.
- Score: 9.205799953828896
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Monocular relative and metric depth estimation has seen a tremendous boost in the last few years due to the sharp advancements in foundation models and in particular transformer based networks. As we start to see applications to the domain of endoscopic images, there is still a lack of robust benchmarks and high-quality datasets in that area. This paper addresses these limitations by presenting a comprehensive benchmark of state-of-the-art (metric and relative) depth estimation models evaluated on real, unseen endoscopic images, providing critical insights into their generalisation and performance in clinical scenarios. Additionally, we introduce and publish a novel synthetic dataset (EndoSynth) of endoscopic surgical instruments paired with ground truth metric depth and segmentation masks, designed to bridge the gap between synthetic and real-world data. We demonstrate that fine-tuning depth foundation models using our synthetic dataset boosts accuracy on most unseen real data by a significant margin. By providing both a benchmark and a synthetic dataset, this work advances the field of depth estimation for endoscopic images and serves as an important resource for future research. Project page, EndoSynth dataset and trained weights are available at https://github.com/TouchSurgery/EndoSynth.
Related papers
- RealSynCol: a high-fidelity synthetic colon dataset for 3D reconstruction applications [33.26682919703966]
We propose RealSynCol, a highly realistic synthetic dataset designed to replicate the endoscopic environment.<n>The resulting dataset comprises 28,130 frames, paired with ground truth depth maps, optical flow, 3D meshes, and camera trajectories.<n>Results demonstrate that the high realism and variability of RealSynCol significantly enhance generalization performance on clinical images.
arXiv Detail & Related papers (2026-02-09T08:57:37Z) - UniSH: Unifying Scene and Human Reconstruction in a Feed-Forward Pass [83.7071371474926]
UniSH is a unified, feed-forward framework for joint metric-scale 3D scene and human reconstruction.<n>Our framework bridges strong, disparate priors from scene reconstruction and HMR.<n>Our model achieves state-of-the-art performance on human-centric scene reconstruction.
arXiv Detail & Related papers (2026-01-03T16:06:27Z) - Scaling Transformer-Based Novel View Synthesis Models with Token Disentanglement and Synthetic Data [53.040873127309766]
We propose a token disentanglement process within the transformer architecture, enhancing feature separation and ensuring more effective learning.<n>Our method outperforms existing models on both in-dataset and cross-dataset evaluations.
arXiv Detail & Related papers (2025-09-08T17:58:06Z) - Underwater Monocular Metric Depth Estimation: Real-World Benchmarks and Synthetic Fine-Tuning with Vision Foundation Models [0.0]
We present a benchmark of zero-shot and fine-tuned monocular metric depth estimation models on real-world underwater datasets.<n>Our results show that large-scale models trained on terrestrial data (real or synthetic) are effective in in-air settings, but perform poorly underwater.<n>This study presents a detailed evaluation and visualization of monocular metric depth estimation in underwater scenes.
arXiv Detail & Related papers (2025-07-02T21:06:39Z) - Boosting Zero-shot Stereo Matching using Large-scale Mixed Images Sources in the Real World [8.56549004133167]
Stereo matching methods rely on dense pixel-wise ground truth labels.<n>The scarcity of labeled data and domain gaps between synthetic and real-world images pose notable challenges.<n>We propose a novel framework, textbfBooSTer, that leverages both vision foundation models and large-scale mixed image sources.
arXiv Detail & Related papers (2025-05-13T14:24:38Z) - SynFER: Towards Boosting Facial Expression Recognition with Synthetic Data [78.70620682374624]
We introduce SynFER, a novel framework for synthesizing facial expression image data based on high-level textual descriptions.<n>To ensure the quality and reliability of the synthetic data, we propose a semantic guidance technique and a pseudo-label generator.<n>Results validate the efficacy of our approach and the synthetic data.
arXiv Detail & Related papers (2024-10-13T14:58:21Z) - Structure-preserving Image Translation for Depth Estimation in Colonoscopy Video [1.0485739694839669]
We propose a pipeline of structure-preserving synthetic-to-real (sim2real) image translation.
This allows us to generate large quantities of realistic-looking synthetic images for supervised depth estimation.
We also propose a dataset of hand-picked sequences from clinical colonoscopies to improve the image translation process.
arXiv Detail & Related papers (2024-08-19T17:02:16Z) - Deep Domain Adaptation: A Sim2Real Neural Approach for Improving Eye-Tracking Systems [80.62854148838359]
Eye image segmentation is a critical step in eye tracking that has great influence over the final gaze estimate.
We use dimensionality-reduction techniques to measure the overlap between the target eye images and synthetic training data.
Our methods result in robust, improved performance when tackling the discrepancy between simulation and real-world data samples.
arXiv Detail & Related papers (2024-03-23T22:32:06Z) - CarPatch: A Synthetic Benchmark for Radiance Field Evaluation on Vehicle
Components [77.33782775860028]
We introduce CarPatch, a novel synthetic benchmark of vehicles.
In addition to a set of images annotated with their intrinsic and extrinsic camera parameters, the corresponding depth maps and semantic segmentation masks have been generated for each view.
Global and part-based metrics have been defined and used to evaluate, compare, and better characterize some state-of-the-art techniques.
arXiv Detail & Related papers (2023-07-24T11:59:07Z) - Do More With What You Have: Transferring Depth-Scale from Labeled to Unlabeled Domains [43.16293941978469]
Self-supervised depth estimators result in up-to-scale predictions that are linearly correlated to their absolute depth values across the domain.
We show that aligning the field-of-view of two datasets prior to training results in a common linear relationship for both domains.
We use this observed property to transfer the depth-scale from source datasets that have absolute depth labels to new target datasets that lack these measurements.
arXiv Detail & Related papers (2023-03-14T07:07:34Z) - Is synthetic data from generative models ready for image recognition? [69.42645602062024]
We study whether and how synthetic images generated from state-of-the-art text-to-image generation models can be used for image recognition tasks.
We showcase the powerfulness and shortcomings of synthetic data from existing generative models, and propose strategies for better applying synthetic data for recognition tasks.
arXiv Detail & Related papers (2022-10-14T06:54:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.