Towards Dynamic 3D Reconstruction of Hand-Instrument Interaction in Ophthalmic Surgery
- URL: http://arxiv.org/abs/2505.17677v2
- Date: Fri, 30 May 2025 04:41:21 GMT
- Title: Towards Dynamic 3D Reconstruction of Hand-Instrument Interaction in Ophthalmic Surgery
- Authors: Ming Hu, Zhengdi Yu, Feilong Tang, Kaiwen Chen, Yulong Li, Imran Razzak, Junjun He, Tolga Birdal, Kaijing Zhou, Zongyuan Ge,
- Abstract summary: This work introduces OphNet-3D, the first extensive RGB-D dynamic 3D reconstruction dataset for ophthalmic surgery.<n>It comprises 41 sequences from 40 surgeons and totaling 7.1 million frames, with fine-grained annotations of 12 surgical phases, 10 instrument categories, dense MANO hand meshes, and full 6-DoF instrument poses.<n>Building upon OphNet-3D, we establish two challenging benchmarks-bimanual hand pose estimation and hand-instrument interaction reconstruction.
- Score: 38.9015512099686
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Accurate 3D reconstruction of hands and instruments is critical for vision-based analysis of ophthalmic microsurgery, yet progress has been hampered by the lack of realistic, large-scale datasets and reliable annotation tools. In this work, we introduce OphNet-3D, the first extensive RGB-D dynamic 3D reconstruction dataset for ophthalmic surgery, comprising 41 sequences from 40 surgeons and totaling 7.1 million frames, with fine-grained annotations of 12 surgical phases, 10 instrument categories, dense MANO hand meshes, and full 6-DoF instrument poses. To scalably produce high-fidelity labels, we design a multi-stage automatic annotation pipeline that integrates multi-view data observation, data-driven motion prior with cross-view geometric consistency and biomechanical constraints, along with a combination of collision-aware interaction constraints for instrument interactions. Building upon OphNet-3D, we establish two challenging benchmarks-bimanual hand pose estimation and hand-instrument interaction reconstruction-and propose two dedicated architectures: H-Net for dual-hand mesh recovery and OH-Net for joint reconstruction of two-hand-two-instrument interactions. These models leverage a novel spatial reasoning module with weak-perspective camera modeling and collision-aware center-based representation. Both architectures outperform existing methods by substantial margins, achieving improvements of over 2mm in Mean Per Joint Position Error (MPJPE) and up to 23% in ADD-S metrics for hand and instrument reconstruction, respectively.
Related papers
- MEgoHand: Multimodal Egocentric Hand-Object Interaction Motion Generation [28.75149480374178]
MEgoHand is a framework that synthesizes physically plausible hand-object interactions from egocentric RGB, text, and initial hand pose.<n>It achieves substantial reductions in wrist translation error and joint rotation error, highlighting its capacity to accurately model fine-grained hand joint structures.
arXiv Detail & Related papers (2025-05-22T12:37:47Z) - VM-BHINet:Vision Mamba Bimanual Hand Interaction Network for 3D Interacting Hand Mesh Recovery From a Single RGB Image [13.009696075460521]
Vision Mamba Bimanual Hand Interaction Network (VM-BHINet) introduces state space models (SSMs) into hand reconstruction to enhance interaction modeling.<n>The core component, Vision Mamba Interaction Feature Extraction Block (VM-IFEBlock), combines SSMs with local and global feature operations.<n> Experiments on the InterHand2.6M dataset show that VM-BHINet reduces Mean per-joint position error (MPJPE) and Mean per-vertex position error (MPVPE) by 2-3%.
arXiv Detail & Related papers (2025-04-20T13:54:22Z) - Learning to Align and Refine: A Foundation-to-Diffusion Framework for Occlusion-Robust Two-Hand Reconstruction [50.952228546326516]
Two-hand reconstruction from monocular images faces persistent challenges due to complex and dynamic hand postures.<n>Existing approaches struggle with such alignment issues, often resulting in misalignment and penetration artifacts.<n>We propose a dual-stage Foundation-to-Diffusion framework that precisely align 2D prior guidance from vision foundation models.
arXiv Detail & Related papers (2025-03-22T14:42:27Z) - Betsu-Betsu: Multi-View Separable 3D Reconstruction of Two Interacting Objects [67.96148051569993]
This paper introduces a new neuro-implicit method that can reconstruct the geometry and appearance of two objects undergoing close interactions while disjoining both in 3D.<n>The framework is end-to-end trainable and supervised using a novel alpha-blending regularisation.<n>We introduce a new dataset consisting of close interactions between a human and an object and also evaluate on two scenes of humans performing martial arts.
arXiv Detail & Related papers (2025-02-19T18:59:56Z) - WiLoR: End-to-end 3D Hand Localization and Reconstruction in-the-wild [53.288327629960364]
We present a data-driven pipeline for efficient multi-hand reconstruction in the wild.<n>The proposed pipeline is composed of two components: a real-time fully convolutional hand localization and a high-fidelity transformer-based 3D hand reconstruction model.<n>Our approach outperforms previous methods in both efficiency and accuracy on popular 2D and 3D benchmarks.
arXiv Detail & Related papers (2024-09-18T18:46:51Z) - HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and
Objects from Video [70.11702620562889]
HOLD -- the first category-agnostic method that reconstructs an articulated hand and object jointly from a monocular interaction video.
We develop a compositional articulated implicit model that can disentangled 3D hand and object from 2D images.
Our method does not rely on 3D hand-object annotations while outperforming fully-supervised baselines in both in-the-lab and challenging in-the-wild settings.
arXiv Detail & Related papers (2023-11-30T10:50:35Z) - Syn3DWound: A Synthetic Dataset for 3D Wound Bed Analysis [28.960666848416274]
This paper introduces Syn3DWound, an open-source dataset of high-fidelity simulated wounds with 2D and 3D annotations.
We propose a benchmarking framework for automated 3D morphometry analysis and 2D/3D wound segmentation.
arXiv Detail & Related papers (2023-11-27T13:59:53Z) - MOHO: Learning Single-view Hand-held Object Reconstruction with
Multi-view Occlusion-Aware Supervision [75.38953287579616]
We present a novel framework to exploit Multi-view Occlusion-aware supervision from hand-object videos for Hand-held Object reconstruction.
We tackle two predominant challenges in such setting: hand-induced occlusion and object's self-occlusion.
Experiments on HO3D and DexYCB datasets demonstrate 2D-supervised MOHO gains superior results against 3D-supervised methods by a large margin.
arXiv Detail & Related papers (2023-10-18T03:57:06Z) - Two-and-a-half Order Score-based Model for Solving 3D Ill-posed Inverse
Problems [7.074380879971194]
We propose a novel two-and-a-half order score-based model (TOSM) for 3D volumetric reconstruction.
During the training phase, our TOSM learns data distributions in 2D space, which reduces the complexity of training.
In the reconstruction phase, the TOSM updates the data distribution in 3D space, utilizing complementary scores along three directions.
arXiv Detail & Related papers (2023-08-16T17:07:40Z) - Joint Hand-object 3D Reconstruction from a Single Image with
Cross-branch Feature Fusion [78.98074380040838]
We propose to consider hand and object jointly in feature space and explore the reciprocity of the two branches.
We employ an auxiliary depth estimation module to augment the input RGB image with the estimated depth map.
Our approach significantly outperforms existing approaches in terms of the reconstruction accuracy of objects.
arXiv Detail & Related papers (2020-06-28T09:50:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.