4D Monocular Surgical Reconstruction under Arbitrary Camera Motions
- URL: http://arxiv.org/abs/2602.17473v1
- Date: Thu, 19 Feb 2026 15:37:27 GMT
- Title: 4D Monocular Surgical Reconstruction under Arbitrary Camera Motions
- Authors: Jiwei Shan, Zeyu Cai, Cheng-Tai Hsieh, Yirui Li, Hao Liu, Lijun Han, Hesheng Wang, Shing Shin Cheng,
- Abstract summary: Local-EndoGS is a high-quality 4D reconstruction framework for monocular endoscopic sequences with arbitrary camera motion.<n>We show that Local-EndoGS consistently outperforms state-of-the-art methods in appearance quality and geometry.
- Score: 21.36069198688806
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reconstructing deformable surgical scenes from endoscopic videos is challenging and clinically important. Recent state-of-the-art methods based on implicit neural representations or 3D Gaussian splatting have made notable progress. However, most are designed for deformable scenes with fixed endoscope viewpoints and rely on stereo depth priors or accurate structure-from-motion for initialization and optimization, limiting their ability to handle monocular sequences with large camera motion in real clinical settings. To address this, we propose Local-EndoGS, a high-quality 4D reconstruction framework for monocular endoscopic sequences with arbitrary camera motion. Local-EndoGS introduces a progressive, window-based global representation that allocates local deformable scene models to each observed window, enabling scalability to long sequences with substantial motion. To overcome unreliable initialization without stereo depth or accurate structure-from-motion, we design a coarse-to-fine strategy integrating multi-view geometry, cross-window information, and monocular depth priors, providing a robust foundation for optimization. We further incorporate long-range 2D pixel trajectory constraints and physical motion priors to improve deformation plausibility. Experiments on three public endoscopic datasets with deformable scenes and varying camera motions show that Local-EndoGS consistently outperforms state-of-the-art methods in appearance quality and geometry. Ablation studies validate the effectiveness of our key designs. Code will be released upon acceptance at: https://github.com/IRMVLab/Local-EndoGS.
Related papers
- Motion4D: Learning 3D-Consistent Motion and Semantics for 4D Scene Understanding [54.859943475818234]
We present Motion4D, a novel framework that integrates 2D priors from foundation models into a unified 4D Gaussian Splatting representation.<n>Our method features a two-part iterative optimization framework: 1) Sequential optimization, which updates motion and semantic fields in consecutive stages to maintain local consistency, and 2) Global optimization, which jointly refines all attributes for long-term coherence.<n>Our method significantly outperforms both 2D foundation models and existing 3D-based approaches across diverse scene understanding tasks, including point-based tracking, video object segmentation, and novel view synthesis.
arXiv Detail & Related papers (2025-12-03T09:32:56Z) - EndoWave: Rational-Wavelet 4D Gaussian Splatting for Endoscopic Reconstruction [18.43808203690038]
endoscopic scenarios present unique challenges, including photometric inconsistencies, non-rigid tissue motion, and view-dependent highlights.<n>Most 3DGS-based methods rely that solely on appearance constraints for optimizing 3DGS are often insufficient in this context.<n>We present EndoWave, which incorporates an optical flow-based geometric constraint and a multi-resolution rational wavelet supervision.
arXiv Detail & Related papers (2025-10-27T07:45:17Z) - Visual Odometry with Transformers [68.453547770334]
We introduce Visual odometry Transformer (VoT), which processes sequences of monocular frames by extracting features.<n>Unlike prior methods, VoT directly predicts camera motion without estimating dense geometry and relies solely on camera poses for supervision.<n>VoT scales effectively with larger datasets, benefits substantially from stronger pre-trained backbones, generalizes across diverse camera motions and calibration settings, and outperforms traditional methods while running more than 3 times faster.
arXiv Detail & Related papers (2025-10-02T17:00:14Z) - EndoFlow-SLAM: Real-Time Endoscopic SLAM with Flow-Constrained Gaussian Splatting [7.7956059927002705]
We introduce optical flow loss as a geometric constraint, which effectively constrains both the 3D structure of the scene and the camera motion.<n>In addition, to improve scene representation in the SLAM system, we improve the 3DGS refinement strategy by focusing on viewpoints corresponding to Keyframes.<n>Our method outperforms existing state-of-the-art methods in novel view synthesis and pose estimation.
arXiv Detail & Related papers (2025-06-26T16:06:46Z) - DGS-LRM: Real-Time Deformable 3D Gaussian Reconstruction From Monocular Videos [52.46386528202226]
We introduce the Deformable Gaussian Splats Large Reconstruction Model (DGS-LRM)<n>It is the first feed-forward method predicting deformable 3D Gaussian splats from a monocular posed video of any dynamic scene.<n>It achieves performance on par with state-of-the-art monocular video 3D tracking methods.
arXiv Detail & Related papers (2025-06-11T17:59:58Z) - Endo3R: Unified Online Reconstruction from Dynamic Monocular Endoscopic Video [35.241054116681426]
Endo3R is a unified 3D foundation model for online scale-consistent reconstruction from monocular surgical video.<n>Our model unifies the tasks by predicting globally aligned pointmaps, scale-consistent video depths, and camera parameters without any offline optimization.
arXiv Detail & Related papers (2025-04-04T06:05:22Z) - Free-DyGS: Camera-Pose-Free Scene Reconstruction for Dynamic Surgical Videos with Gaussian Splatting [17.0449317212827]
We propose a novel framework for fast reconstruction, termed textitFree-DyGS.<n>The framework is equipped with a novel Retrospective Deformation Recapitulation (RDR) strategy to preserve the entire-clip deformations throughout the frame-by-frame training scheme.
arXiv Detail & Related papers (2024-09-02T07:28:14Z) - MultiViPerFrOG: A Globally Optimized Multi-Viewpoint Perception Framework for Camera Motion and Tissue Deformation [18.261678529996104]
We propose a framework that can flexibly integrate the output of low-level perception modules with kinematic and scene-modeling priors.
Overall, our method shows robustness to combined noisy input measures and can process hundreds of points in a few milliseconds.
arXiv Detail & Related papers (2024-08-08T10:55:55Z) - FLex: Joint Pose and Dynamic Radiance Fields Optimization for Stereo Endoscopic Videos [79.50191812646125]
Reconstruction of endoscopic scenes is an important asset for various medical applications, from post-surgery analysis to educational training.
We adress the challenging setup of a moving endoscope within a highly dynamic environment of deforming tissue.
We propose an implicit scene separation into multiple overlapping 4D neural radiance fields (NeRFs) and a progressive optimization scheme jointly optimizing for reconstruction and camera poses from scratch.
This improves the ease-of-use and allows to scale reconstruction capabilities in time to process surgical videos of 5,000 frames and more; an improvement of more than ten times compared to the state of the art while being agnostic to external tracking information
arXiv Detail & Related papers (2024-03-18T19:13:02Z) - EndoGS: Deformable Endoscopic Tissues Reconstruction with Gaussian Splatting [20.848027172010358]
We present EndoGS, applying Gaussian Splatting for deformable endoscopic tissue reconstruction.
Our approach incorporates deformation fields to handle dynamic scenes, depth-guided supervision with spatial-temporal weight masks, and surface-aligned regularization terms.
As a result, EndoGS reconstructs and renders high-quality deformable endoscopic tissues from a single-viewpoint video, estimated depth maps, and labeled tool masks.
arXiv Detail & Related papers (2024-01-21T16:14:04Z) - Unbiased 4D: Monocular 4D Reconstruction with a Neural Deformation Model [76.64071133839862]
Capturing general deforming scenes from monocular RGB video is crucial for many computer graphics and vision applications.
Our method, Ub4D, handles large deformations, performs shape completion in occluded regions, and can operate on monocular RGB videos directly by using differentiable volume rendering.
Results on our new dataset, which will be made publicly available, demonstrate a clear improvement over the state of the art in terms of surface reconstruction accuracy and robustness to large deformations.
arXiv Detail & Related papers (2022-06-16T17:59:54Z) - MoCaNet: Motion Retargeting in-the-wild via Canonicalization Networks [77.56526918859345]
We present a novel framework that brings the 3D motion task from controlled environments to in-the-wild scenarios.
It is capable of body motion from a character in a 2D monocular video to a 3D character without using any motion capture system or 3D reconstruction procedure.
arXiv Detail & Related papers (2021-12-19T07:52:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.