Related papers: SCOPE: Speech-guided COllaborative PErception Framework for Surgical Scene Segmentation

SCOPE: Speech-guided COllaborative PErception Framework for Surgical Scene Segmentation

URL: http://arxiv.org/abs/2509.10748v1
Date: Fri, 12 Sep 2025 23:36:52 GMT
Title: SCOPE: Speech-guided COllaborative PErception Framework for Surgical Scene Segmentation
Authors: Jecia Z. Y. Mao, Francis X Creighton, Russell H Taylor, Manish Sahu,
Abstract summary: We introduce a speech-guided collaborative perception framework that integrates reasoning capabilities of large language model (LLM) with perception capabilities of open-set VFMs.<n>A key component of this framework is a collaborative perception agent, which generates top candidates of VFM-generated segmentation.<n> instruments themselves serve as interactive pointers to label additional elements of the surgical scene.
Score: 4.97436124491469
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Accurate segmentation and tracking of relevant elements of the surgical scene is crucial to enable context-aware intraoperative assistance and decision making. Current solutions remain tethered to domain-specific, supervised models that rely on labeled data and required domain-specific data to adapt to new surgical scenarios and beyond predefined label categories. Recent advances in prompt-driven vision foundation models (VFM) have enabled open-set, zero-shot segmentation across heterogeneous medical images. However, dependence of these models on manual visual or textual cues restricts their deployment in introperative surgical settings. We introduce a speech-guided collaborative perception (SCOPE) framework that integrates reasoning capabilities of large language model (LLM) with perception capabilities of open-set VFMs to support on-the-fly segmentation, labeling and tracking of surgical instruments and anatomy in intraoperative video streams. A key component of this framework is a collaborative perception agent, which generates top candidates of VFM-generated segmentation and incorporates intuitive speech feedback from clinicians to guide the segmentation of surgical instruments in a natural human-machine collaboration paradigm. Afterwards, instruments themselves serve as interactive pointers to label additional elements of the surgical scene. We evaluated our proposed framework on a subset of publicly available Cataract1k dataset and an in-house ex-vivo skull-base dataset to demonstrate its potential to generate on-the-fly segmentation and tracking of surgical scene. Furthermore, we demonstrate its dynamic capabilities through a live mock ex-vivo experiment. This human-AI collaboration paradigm showcase the potential of developing adaptable, hands-free, surgeon-centric tools for dynamic operating-room environments.

Related papers

GroundedSurg: A Multi-Procedure Benchmark for Language-Conditioned Surgical Tool Segmentation [1.9981885081131854]
We introduce GroundedSurg, the first language-conditioned, instance-level surgical grounding benchmark.<n>The dataset spans ophthalmic, laparoscopic, robotic, and open procedures, encompassing diverse instrument types, imaging conditions, and operative complexities.
arXiv Detail & Related papers (2026-03-01T13:49:53Z)
VLM-Guided Iterative Refinement for Surgical Image Segmentation with Foundation Models [16.299786004060863]
IR-SIS is an iterative refinement system for surgical image segmentation that accepts natural language descriptions.<n>The system supports clinician-in-the-loop interaction through natural language feedback.<n>Our work establishes the first language-based surgical segmentation framework with adaptive self-refinement capabilities.
arXiv Detail & Related papers (2026-02-09T22:36:36Z)
Where It Moves, It Matters: Referring Surgical Instrument Segmentation via Motion [54.359489807885616]
SurgRef is a motion-guided framework that grounds free-form language expressions in instrument motion, rather than what they look like.<n>To train and evaluate SurgRef, we present Ref-IMotion, a diverse, multi-institutional video dataset with densetemporal masks and rich motion expressions.
arXiv Detail & Related papers (2026-01-18T02:14:08Z)
SurgMLLMBench: A Multimodal Large Language Model Benchmark Dataset for Surgical Scene Understanding [8.20483591990742]
We present SurgMLLMBench, a unified benchmark for developing and evaluating interactive multimodal large language models.<n>It integrates pixel-level instrument segmentation masks and structured VQA annotations across laparoscopic, robot-assisted, and micro-surgical domains.<n>It achieves consistent performance across domains and generalizes effectively to unseen datasets.
arXiv Detail & Related papers (2025-11-26T12:44:51Z)
Surgical Foundation Model Leveraging Compression and Entropy Maximization for Image-Guided Surgical Assistance [50.486523249499115]
Real-time video understanding is critical to guide procedures in minimally invasive surgery (MIS)<n>We propose Compress-to-Explore (C2E), a novel self-supervised framework to learn compact, informative representations from surgical videos.<n>C2E uses entropy-maximizing decoders to compress images while preserving clinically relevant details, improving encoder performance without labeled data.
arXiv Detail & Related papers (2025-05-16T14:02:24Z)
Probabilistic Task Parameterization of Tool-Tissue Interaction via Sparse Landmarks Tracking in Robotic Surgery [5.075735148466963]
Models of tool-tissue interactions in robotic surgery require precise tracking of deformable tissues and integration of surgical domain knowledge.<n>We propose a framework combining keypoint tracking and probabilistic modeling that propagates expert-annotated landmarks across endoscopic frames.
arXiv Detail & Related papers (2025-04-14T21:28:48Z)
EndoChat: Grounded Multimodal Large Language Model for Endoscopic Surgery [52.992415247012296]
We introduce EndoChat to address various dialogue paradigms and subtasks in surgical scene understanding.<n>Our model achieves state-of-the-art performance across five dialogue paradigms and eight surgical scene understanding tasks.
arXiv Detail & Related papers (2025-01-20T09:12:06Z)
LIMIS: Towards Language-based Interactive Medical Image Segmentation [58.553786162527686]
LIMIS is the first purely language-based interactive medical image segmentation model. We adapt Grounded SAM to the medical domain and design a language-based model interaction strategy. We evaluate LIMIS on three publicly available medical datasets in terms of performance and usability.
arXiv Detail & Related papers (2024-10-22T12:13:47Z)
Hypergraph-Transformer (HGT) for Interactive Event Prediction in Laparoscopic and Robotic Surgery [47.47211257890948]
We propose a predictive neural network that is capable of understanding and predicting critical interactive aspects of surgical workflow from intra-abdominal video.<n>We verify our approach on established surgical datasets and applications, including the detection and prediction of action triplets.<n>Our results demonstrate the superiority of our approach compared to unstructured alternatives.
arXiv Detail & Related papers (2024-02-03T00:58:05Z)
Pixel-Wise Recognition for Holistic Surgical Scene Understanding [33.40319680006502]
This paper presents the Holistic and Multi-Granular Surgical Scene Understanding of Prostatectomies dataset.<n>Our benchmark models surgical scene understanding as a hierarchy of complementary tasks with varying levels of granularity.<n>To exploit our proposed benchmark, we introduce the Transformers for Actions, Phases, Steps, and Instrument (TAPIS) model.
arXiv Detail & Related papers (2024-01-20T09:09:52Z)
SAR-RARP50: Segmentation of surgical instrumentation and Action Recognition on Robot-Assisted Radical Prostatectomy Challenge [72.97934765570069]
We release the first multimodal, publicly available, in-vivo, dataset for surgical action recognition and semantic instrumentation segmentation, containing 50 suturing video segments of Robotic Assisted Radical Prostatectomy (RARP) The aim of the challenge is to enable researchers to leverage the scale of the provided dataset and develop robust and highly accurate single-task action recognition and tool segmentation approaches in the surgical domain. A total of 12 teams participated in the challenge, contributing 7 action recognition methods, 9 instrument segmentation techniques, and 4 multitask approaches that integrated both action recognition and instrument segmentation.
arXiv Detail & Related papers (2023-12-31T13:32:18Z)
FUN-SIS: a Fully UNsupervised approach for Surgical Instrument Segmentation [16.881624842773604]
We present FUN-SIS, a Fully-supervised approach for binary Surgical Instrument. We train a per-frame segmentation model on completely unlabelled endoscopic videos, by relying on implicit motion information and instrument shape-priors. The obtained fully-unsupervised results for surgical instrument segmentation are almost on par with the ones of fully-supervised state-of-the-art approaches.
arXiv Detail & Related papers (2022-02-16T15:32:02Z)
Multimodal Semantic Scene Graphs for Holistic Modeling of Surgical Procedures [70.69948035469467]
We take advantage of the latest computer vision methodologies for generating 3D graphs from camera views. We then introduce the Multimodal Semantic Graph Scene (MSSG) which aims at providing unified symbolic and semantic representation of surgical procedures.
arXiv Detail & Related papers (2021-06-09T14:35:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.