Where It Moves, It Matters: Referring Surgical Instrument Segmentation via Motion
- URL: http://arxiv.org/abs/2601.12224v1
- Date: Sun, 18 Jan 2026 02:14:08 GMT
- Title: Where It Moves, It Matters: Referring Surgical Instrument Segmentation via Motion
- Authors: Meng Wei, Kun Yuan, Shi Li, Yue Zhou, Long Bai, Nassir Navab, Hongliang Ren, Hong Joo Lee, Tom Vercauteren, Nicolas Padoy,
- Abstract summary: SurgRef is a motion-guided framework that grounds free-form language expressions in instrument motion, rather than what they look like.<n>To train and evaluate SurgRef, we present Ref-IMotion, a diverse, multi-institutional video dataset with densetemporal masks and rich motion expressions.
- Score: 54.359489807885616
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Enabling intuitive, language-driven interaction with surgical scenes is a critical step toward intelligent operating rooms and autonomous surgical robotic assistance. However, the task of referring segmentation, localizing surgical instruments based on natural language descriptions, remains underexplored in surgical videos, with existing approaches struggling to generalize due to reliance on static visual cues and predefined instrument names. In this work, we introduce SurgRef, a novel motion-guided framework that grounds free-form language expressions in instrument motion, capturing how tools move and interact across time, rather than what they look like. This allows models to understand and segment instruments even under occlusion, ambiguity, or unfamiliar terminology. To train and evaluate SurgRef, we present Ref-IMotion, a diverse, multi-institutional video dataset with dense spatiotemporal masks and rich motion-centric expressions. SurgRef achieves state-of-the-art accuracy and generalization across surgical procedures, setting a new benchmark for robust, language-driven surgical video segmentation.
Related papers
- GroundedSurg: A Multi-Procedure Benchmark for Language-Conditioned Surgical Tool Segmentation [1.9981885081131854]
We introduce GroundedSurg, the first language-conditioned, instance-level surgical grounding benchmark.<n>The dataset spans ophthalmic, laparoscopic, robotic, and open procedures, encompassing diverse instrument types, imaging conditions, and operative complexities.
arXiv Detail & Related papers (2026-03-01T13:49:53Z) - Sound Source Localization for Spatial Mapping of Surgical Actions in Dynamic Scenes [0.5146940511526402]
This work aims to enhance surgical scene representations by integrating 3D acoustic information.<n>We propose a novel framework for generating 4D audio-visual representations of surgical scenes.<n>The proposed framework enables richer contextual understanding and provides a foundation for future intelligent surgical systems.
arXiv Detail & Related papers (2025-10-28T11:55:45Z) - SCOPE: Speech-guided COllaborative PErception Framework for Surgical Scene Segmentation [4.97436124491469]
We introduce a speech-guided collaborative perception framework that integrates reasoning capabilities of large language model (LLM) with perception capabilities of open-set VFMs.<n>A key component of this framework is a collaborative perception agent, which generates top candidates of VFM-generated segmentation.<n> instruments themselves serve as interactive pointers to label additional elements of the surgical scene.
arXiv Detail & Related papers (2025-09-12T23:36:52Z) - SurgTPGS: Semantic 3D Surgical Scene Understanding with Text Promptable Gaussian Splatting [45.16104996137126]
We present SurgTPGS, a novel text-promptable Gaussian Splatting method to fill this gap.<n>We propose semantic-aware deformation tracking to capture the seamless deformation of semantic features, providing a more precise reconstruction for both texture and semantic features.<n>We conduct comprehensive experiments on two real-world surgical datasets to demonstrate the superiority of SurgTPGS over state-of-the-art methods.
arXiv Detail & Related papers (2025-06-29T15:55:01Z) - SurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language Model [67.8359850515282]
SurgVidLM is the first video language model designed to address both full and fine-grained surgical video comprehension.<n>We show that SurgVidLM significantly outperforms state-of-the-art Vid-LLMs of comparable parameter scale in both full and fine-grained video understanding tasks.
arXiv Detail & Related papers (2025-06-22T02:16:18Z) - Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge Augmentation [51.222684687924215]
Surgical video-language pretraining faces unique challenges due to the knowledge domain gap and the scarcity of multi-modal data.<n>We propose a hierarchical knowledge augmentation approach and a novel Procedure-Encoded Surgical Knowledge-Augmented Video-Language Pretraining framework to tackle these issues.
arXiv Detail & Related papers (2024-09-30T22:21:05Z) - HecVL: Hierarchical Video-Language Pretraining for Zero-shot Surgical Phase Recognition [51.222684687924215]
HecVL is a novel hierarchical video-language pretraining approach for building a generalist surgical model.<n>By disentangling embedding spaces of different hierarchical levels, the learned multi-modal representations encode short-term and long-term surgical concepts in the same model.<n>We show that the same HecVL model for surgical phase recognition can be transferred across different surgical procedures and medical centers.
arXiv Detail & Related papers (2024-05-16T13:14:43Z) - Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures [50.09187683845788]
Recent advancements in surgical computer vision applications have been driven by vision-only models.<n>These methods rely on manually annotated surgical videos to predict a fixed set of object categories.<n>In this work, we put forward the idea that the surgical video lectures available through open surgical e-learning platforms can provide effective vision and language supervisory signals.
arXiv Detail & Related papers (2023-07-27T22:38:12Z) - Text Promptable Surgical Instrument Segmentation with Vision-Language
Models [16.203166812021045]
We propose a novel text promptable surgical instrument segmentation approach to overcome challenges associated with diversity and differentiation of surgical instruments.
We leverage pretrained image and text encoders as our model backbone and design a text promptable mask decoder.
Experiments on several surgical instrument segmentation datasets demonstrate our model's superior performance and promising generalization capability.
arXiv Detail & Related papers (2023-06-15T16:26:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.