Strategy-Supervised Autonomous Laparoscopic Camera Control via Event-Driven Graph Mining
- URL: http://arxiv.org/abs/2602.20500v1
- Date: Tue, 24 Feb 2026 02:56:39 GMT
- Title: Strategy-Supervised Autonomous Laparoscopic Camera Control via Event-Driven Graph Mining
- Authors: Keyu Zhou, Peisen Xu, Yahao Wu, Jiming Chen, Gaofeng Li, Shunlei Li,
- Abstract summary: We present a strategy-grounded framework that couples high-level vision-language inference with low-level closed-loop control.<n> offline, raw surgical videos are parsed into camera-relevant temporal events and structured as attributed event graphs.<n>Online, a fine-tuned Vision-Language Model (VLM) processes the live laparoscopic view to predict the dominant strategy and discrete image-based motion commands.
- Score: 15.995867664955348
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Autonomous laparoscopic camera control must maintain a stable and safe surgical view under rapid tool-tissue interactions while remaining interpretable to surgeons. We present a strategy-grounded framework that couples high-level vision-language inference with low-level closed-loop control. Offline, raw surgical videos are parsed into camera-relevant temporal events (e.g., interaction, working-distance deviation, and view-quality degradation) and structured as attributed event graphs. Mining these graphs yields a compact set of reusable camera-handling strategy primitives, which provide structured supervision for learning. Online, a fine-tuned Vision-Language Model (VLM) processes the live laparoscopic view to predict the dominant strategy and discrete image-based motion commands, executed by an IBVS-RCM controller under strict safety constraints; optional speech input enables intuitive human-in-the-loop conditioning. On a surgeon-annotated dataset, event parsing achieves reliable temporal localization (F1-score 0.86), and the mined strategies show strong semantic alignment with expert interpretation (cluster purity 0.81). Extensive ex vivo experiments on silicone phantoms and porcine tissues demonstrate that the proposed system outperforms junior surgeons in standardized camera-handling evaluations, reducing field-of-view centering error by 35.26% and image shaking by 62.33%, while preserving smooth motion and stable working-distance regulation.
Related papers
- UCM: Unifying Camera Control and Memory with Time-aware Positional Encoding Warping for World Models [54.564740558030245]
We present UCM, a novel framework that unifies long-term memory and precise camera control via a time-aware positional encoding warping mechanism.<n>We also introduce a scalable data curation strategy utilizing point-cloud-based rendering to simulate scene revisiting.
arXiv Detail & Related papers (2026-02-26T12:54:46Z) - SurgAtt-Tracker: Online Surgical Attention Tracking via Temporal Proposal Reranking and Motion-Aware Refinement [45.37105164372227]
SurgAtt-Tracker is a holistic framework that robustly tracks surgical attention.<n>Experiments on multiple surgical datasets demonstrate that SurgAtt-Tracker achieves consistently state-of-the-art performance.
arXiv Detail & Related papers (2026-02-24T07:30:51Z) - Self-Supervised Contrastive Embedding Adaptation for Endoscopic Image Matching [7.674595072442547]
This research presents a novel Deep Learning pipeline for establishing feature correspondences in endoscopic image pairs.<n>The proposed methodology leverages a novel-view synthesis pipeline to generate ground-truth inlier correspondences.<n>Our pipeline surpasses state-of-the-art methodologies on the SCARED datasets improved matching precision and lower epipolar error.
arXiv Detail & Related papers (2025-12-11T07:44:00Z) - EndoControlMag: Robust Endoscopic Vascular Motion Magnification with Periodic Reference Resetting and Hierarchical Tissue-aware Dual-Mask Control [10.426745597034204]
We introduce EndoControlMag, a training-free framework with mask-conditioned vascular motion magnification tailored to endoscopic environments.<n>Our approach features two key modules: a Periodic Reference Resetting scheme that divides videos into short overlapping clips with dynamically updated reference frames to prevent error accumulation.<n>We evaluate EndoControlMag on our EndoVMM24 dataset spanning four different surgery types and various challenging scenarios.
arXiv Detail & Related papers (2025-07-21T06:47:44Z) - Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection [56.66677293607114]
We propose Code-as-Monitor (CaM) for both open-set reactive and proactive failure detection.<n>To enhance the accuracy and efficiency of monitoring, we introduce constraint elements that abstract constraint-related entities.<n>Experiments show that CaM achieves a 28.7% higher success rate and reduces execution time by 31.8% under severe disturbances.
arXiv Detail & Related papers (2024-12-05T18:58:27Z) - Self-Supervised Learning for Interventional Image Analytics: Towards Robust Device Trackers [6.262161803642583]
We propose a novel approach to learn procedural features from a very large data cohort of over 16 million interventional X-ray frames.
Our approach is based on a masked image modeling technique that leverages frame-based reconstruction to learn fine inter-frame temporal correspondences.
Experiments show that our method achieves 66.31% reduction in maximum tracking error against reference solutions.
arXiv Detail & Related papers (2024-05-02T10:18:22Z) - FLex: Joint Pose and Dynamic Radiance Fields Optimization for Stereo Endoscopic Videos [79.50191812646125]
Reconstruction of endoscopic scenes is an important asset for various medical applications, from post-surgery analysis to educational training.
We adress the challenging setup of a moving endoscope within a highly dynamic environment of deforming tissue.
We propose an implicit scene separation into multiple overlapping 4D neural radiance fields (NeRFs) and a progressive optimization scheme jointly optimizing for reconstruction and camera poses from scratch.
This improves the ease-of-use and allows to scale reconstruction capabilities in time to process surgical videos of 5,000 frames and more; an improvement of more than ten times compared to the state of the art while being agnostic to external tracking information
arXiv Detail & Related papers (2024-03-18T19:13:02Z) - LoViT: Long Video Transformer for Surgical Phase Recognition [59.06812739441785]
We present a two-stage method, called Long Video Transformer (LoViT) for fusing short- and long-term temporal information.
Our approach outperforms state-of-the-art methods on the Cholec80 and AutoLaparo datasets consistently.
arXiv Detail & Related papers (2023-05-15T20:06:14Z) - Next-generation Surgical Navigation: Marker-less Multi-view 6DoF Pose Estimation of Surgical Instruments [64.59698930334012]
We present a multi-camera capture setup consisting of static and head-mounted cameras.<n>Second, we publish a multi-view RGB-D video dataset of ex-vivo spine surgeries, captured in a surgical wet lab and a real operating theatre.<n>Third, we evaluate three state-of-the-art single-view and multi-view methods for the task of 6DoF pose estimation of surgical instruments.
arXiv Detail & Related papers (2023-05-05T13:42:19Z) - Real-time Surgical Environment Enhancement for Robot-Assisted Minimally
Invasive Surgery Based on Super-Resolution [18.696539908774454]
We propose a Generative Adversarial Network (GAN)-based video super-resolution method to construct a framework for automatic zooming ratio adjustment.
It can provide automatic real-time zooming for high-quality visualization of the Region Of Interest (ROI) during the surgical operation.
arXiv Detail & Related papers (2020-11-08T15:40:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.