Related papers: EndoStreamDepth: Temporally Consistent Monocular Depth Estimation for Endoscopic Video Streams

EndoStreamDepth: Temporally Consistent Monocular Depth Estimation for Endoscopic Video Streams

URL: http://arxiv.org/abs/2512.18159v1
Date: Sat, 20 Dec 2025 00:53:30 GMT
Title: EndoStreamDepth: Temporally Consistent Monocular Depth Estimation for Endoscopic Video Streams
Authors: Hao Li, Daiwei Lu, Jiacheng Wang, Robert J. Webster, Ipek Oguz,
Abstract summary: This work presents EndoStreamDepth, a monocular depth estimation framework for endoscopic video streams.<n>It provides accurate depth maps with sharp anatomical boundaries for each frame, temporally consistent predictions across frames, and real-time throughput.
Score: 6.300100115696222
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This work presents EndoStreamDepth, a monocular depth estimation framework for endoscopic video streams. It provides accurate depth maps with sharp anatomical boundaries for each frame, temporally consistent predictions across frames, and real-time throughput. Unlike prior work that uses batched inputs, EndoStreamDepth processes individual frames with a temporal module to propagate inter-frame information. The framework contains three main components: (1) a single-frame depth network with endoscopy-specific transformation to produce accurate depth maps, (2) multi-level Mamba temporal modules that leverage inter-frame information to improve accuracy and stabilize predictions, and (3) a hierarchical design with comprehensive multi-scale supervision, where complementary loss terms jointly improve local boundary sharpness and global geometric consistency. We conduct comprehensive evaluations on two publicly available colonoscopy depth estimation datasets. Compared to state-of-the-art monocular depth estimation methods, EndoStreamDepth substantially improves performance, and it produces depth maps with sharp, anatomically aligned boundaries, which are essential to support downstream tasks such as automation for robotic surgery. The code is publicly available at https://github.com/MedICL-VU/EndoStreamDepth

Related papers

Video Depth Propagation [54.523028170425256]
Existing methods rely on simple frame-by-frame monocular models, leading to temporal inconsistencies and inaccuracies.<n>We propose VeloDepth, which effectively leverages an online video pipeline and performs deep feature propagation.<n>Our design structurally enforces temporal consistency, resulting in stable depth predictions across consecutive frames with improved efficiency.
arXiv Detail & Related papers (2025-12-11T15:08:37Z)
Unifying Scale-Aware Depth Prediction and Perceptual Priors for Monocular Endoscope Pose Estimation and Tissue Reconstruction [3.251946340142663]
A unified framework for monocular endoscopic tissue reconstruction is presented.<n>It integrates scale-aware depth prediction with temporally-constrained perceptual refinement.<n> Evaluations on HEVD and SCARED, with ablation and comparative analyses, demonstrate the framework's robustness and superiority over state-of-the-art methods.
arXiv Detail & Related papers (2025-08-15T07:41:17Z)
Robust Real-Time Endoscopic Stereo Matching under Fuzzy Tissue Boundaries [8.217543444539652]
Real-time acquisition of accurate scene depth is essential for automated robotic minimally invasive surgery.<n>Existing stereo matching methods, designed primarily for natural images, often struggle with endoscopic images due to fuzzy tissue boundaries.<n>We propose textbfRRESM, a real-time stereo matching method tailored for endoscopic images.
arXiv Detail & Related papers (2025-03-02T05:06:52Z)
REMOTE: Real-time Ego-motion Tracking for Various Endoscopes via Multimodal Visual Feature Learning [0.7499722271664147]
A novel framework is proposed to perform real-time ego-motion tracking for endoscope.<n>A multi-modal visual feature learning network is proposed to perform relative pose prediction.<n>The absolute pose of endoscope is calculated based on relative poses.
arXiv Detail & Related papers (2025-01-30T03:58:41Z)
DD-VNB: A Depth-based Dual-Loop Framework for Real-time Visually Navigated Bronchoscopy [5.8722774441994074]
We propose a Depth-based Dual-Loop framework for real-time Visually Navigated Bronchoscopy (DD-VNB) The DD-VNB framework integrates two key modules: depth estimation and dual-loop localization. Experiments on phantom and in-vivo data from patients demonstrate the effectiveness of our framework.
arXiv Detail & Related papers (2024-03-04T02:29:02Z)
OPA-3D: Occlusion-Aware Pixel-Wise Aggregation for Monocular 3D Object Detection [51.153003057515754]
OPA-3D is a single-stage, end-to-end, Occlusion-Aware Pixel-Wise Aggregation network. It jointly estimates dense scene depth with depth-bounding box residuals and object bounding boxes. It outperforms state-of-the-art methods on the main Car category.
arXiv Detail & Related papers (2022-11-02T14:19:13Z)
Self-Supervised Depth Estimation in Laparoscopic Image using 3D Geometric Consistency [7.902636435901286]
We present M3Depth, a self-supervised depth estimator to leverage 3D geometric structural information hidden in stereo pairs. Our method outperforms previous self-supervised approaches on both a public dataset and a newly acquired dataset by a large margin.
arXiv Detail & Related papers (2022-08-17T17:03:48Z)
3DVNet: Multi-View Depth Prediction and Volumetric Refinement [68.68537312256144]
3DVNet is a novel multi-view stereo (MVS) depth-prediction method. Our key idea is the use of a 3D scene-modeling network that iteratively updates a set of coarse depth predictions. We show that our method exceeds state-of-the-art accuracy in both depth prediction and 3D reconstruction metrics.
arXiv Detail & Related papers (2021-12-01T00:52:42Z)
Deep Two-View Structure-from-Motion Revisited [83.93809929963969]
Two-view structure-from-motion (SfM) is the cornerstone of 3D reconstruction and visual SLAM. We propose to revisit the problem of deep two-view SfM by leveraging the well-posedness of the classic pipeline. Our method consists of 1) an optical flow estimation network that predicts dense correspondences between two frames; 2) a normalized pose estimation module that computes relative camera poses from the 2D optical flow correspondences, and 3) a scale-invariant depth estimation network that leverages epipolar geometry to reduce the search space, refine the dense correspondences, and estimate relative depth maps.
arXiv Detail & Related papers (2021-04-01T15:31:20Z)
Multi-view Depth Estimation using Epipolar Spatio-Temporal Networks [87.50632573601283]
We present a novel method for multi-view depth estimation from a single video. Our method achieves temporally coherent depth estimation results by using a novel Epipolar Spatio-Temporal (EST) transformer. To reduce the computational cost, inspired by recent Mixture-of-Experts models, we design a compact hybrid network.
arXiv Detail & Related papers (2020-11-26T04:04:21Z)
Don't Forget The Past: Recurrent Depth Estimation from Monocular Video [92.84498980104424]
We put three different types of depth estimation into a common framework. Our method produces a time series of depth maps. It can be applied to monocular videos only or be combined with different types of sparse depth patterns.
arXiv Detail & Related papers (2020-01-08T16:50:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.