SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking
- URL: http://arxiv.org/abs/2511.16618v1
- Date: Thu, 20 Nov 2025 18:18:49 GMT
- Title: SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking
- Authors: Haofeng Liu, Ziyue Wang, Sudhanshu Mishra, Mingqi Gao, Guanyi Qin, Chang Han Low, Alex Y. W. Kong, Yueming Jin,
- Abstract summary: Surgical segmentation is crucial for computer-assisted surgery, enabling precise localization and tracking of instruments and tissues.<n>Interactive Video Object (iVOS) models such as Segment Anything Model 2 (SAM2) provide prompt-based flexibility beyond methods with predefined categories, but face challenges in surgical scenarios due to the domain gap and limited long-term tracking.<n>We construct SA-SV, the largest surgical iVOS benchmark with instance-level temporal annotations (masklets) spanning eight procedure types (61k frames, 1.6k masklets)<n>We propose SAM2S, a foundation model enhancing bftextSAM2 for
- Score: 15.279735515011817
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Surgical video segmentation is crucial for computer-assisted surgery, enabling precise localization and tracking of instruments and tissues. Interactive Video Object Segmentation (iVOS) models such as Segment Anything Model 2 (SAM2) provide prompt-based flexibility beyond methods with predefined categories, but face challenges in surgical scenarios due to the domain gap and limited long-term tracking. To address these limitations, we construct SA-SV, the largest surgical iVOS benchmark with instance-level spatio-temporal annotations (masklets) spanning eight procedure types (61k frames, 1.6k masklets), enabling comprehensive development and evaluation for long-term tracking and zero-shot generalization. Building on SA-SV, we propose SAM2S, a foundation model enhancing \textbf{SAM2} for \textbf{S}urgical iVOS through: (1) DiveMem, a trainable diverse memory mechanism for robust long-term tracking; (2) temporal semantic learning for instrument understanding; and (3) ambiguity-resilient learning to mitigate annotation inconsistencies across multi-source datasets. Extensive experiments demonstrate that fine-tuning on SA-SV enables substantial performance gains, with SAM2 improving by 12.99 average $\mathcal{J}$\&$\mathcal{F}$ over vanilla SAM2. SAM2S further advances performance to 80.42 average $\mathcal{J}$\&$\mathcal{F}$, surpassing vanilla and fine-tuned SAM2 by 17.10 and 4.11 points respectively, while maintaining 68 FPS real-time inference and strong zero-shot generalization. Code and dataset will be released at https://jinlab-imvr.github.io/SAM2S.
Related papers
- Evaluating SAM2 for Video Semantic Segmentation [60.157605818225186]
The Anything Model 2 (SAM2) has proven to be a powerful foundation model for promptable visual object segmentation in both images and videos.<n>This paper explores the extension of SAM2 to dense Video Semantic (VSS)<n>Our experiments suggest that leveraging SAM2 enhances overall performance in VSS, primarily due to its precise predictions of object boundaries.
arXiv Detail & Related papers (2025-12-01T15:15:16Z) - UnSAMv2: Self-Supervised Learning Enables Segment Anything at Any Granularity [54.41309926099154]
We introduce UnSAMv2, which enables segment anything at any granularity without human annotations.<n>UnSAMv2 extends the divide-and-conquer strategy of UnSAM by discovering abundant mask-granularity pairs.<n>We show that small amounts of unlabeled data with a granularity-aware self-supervised learning method can unlock the potential of vision foundation models.
arXiv Detail & Related papers (2025-11-17T18:58:34Z) - VesSAM: Efficient Multi-Prompting for Segmenting Complex Vessel [68.24765319399286]
We present VesSAM, a powerful and efficient framework tailored for 2D vessel segmentation.<n>VesSAM integrates (1) a convolutional adapter to enhance local texture features, (2) a multi-prompt encoder that fuses anatomical prompts, and (3) a lightweight mask decoder to reduce jagged artifacts.<n>VesSAM consistently outperforms state-of-the-art PEFT-based SAM variants by over 10% Dice and 13% IoU.
arXiv Detail & Related papers (2025-11-02T15:47:05Z) - TSMS-SAM2: Multi-scale Temporal Sampling Augmentation and Memory-Splitting Pruning for Promptable Video Object Segmentation and Tracking in Surgical Scenarios [1.0596160761674702]
We propose TSMS-SAM2, a novel framework that enhances promptable VOST in surgical videos by addressing challenges of rapid object motion and memory redundancy.<n>TSMS-SAM2 introduces two key strategies: multi-temporal-scale video sampling augmentation to improve robustness against motion variability, and a memory splitting and pruning mechanism that organizes and filters past frame features.
arXiv Detail & Related papers (2025-08-07T20:11:15Z) - Depthwise-Dilated Convolutional Adapters for Medical Object Tracking and Segmentation Using the Segment Anything Model 2 [3.2852663769413106]
We propose DD-SAM2, an efficient adaptation framework for SAM2.<n> DD-SAM2 incorporates a Depthwise-Dilated Adapter (DD-Adapter) to enhance multi-scale feature extraction.<n> DD-SAM2 fully exploits SAM2's streaming memory for medical video object tracking and segmentation.
arXiv Detail & Related papers (2025-07-19T13:19:55Z) - Memory-Augmented SAM2 for Training-Free Surgical Video Segmentation [18.71772979219666]
We introduce Memory Augmented (MA)-SAM2, a training-free video object segmentation strategy.<n>MA-SAM2 exhibits strong robustness against occlusions and interactions arising from complex instrument movements.<n>Without introducing any additional parameters or requiring further training, MA-SAM2 achieved performance improvements of 4.36% and 6.1% over SAM2 on the EndoVis 2017 and EndoVis 2018 datasets.
arXiv Detail & Related papers (2025-07-13T11:05:25Z) - Accelerating Volumetric Medical Image Annotation via Short-Long Memory SAM 2 [12.243345510831263]
Short-Long Memory SAM 2 (SLM-SAM 2) is a novel architecture that integrates distinct short-term and long-term memory banks to improve segmentation accuracy.<n>We evaluate SLM-SAM 2 on four public datasets covering organs, bones, and muscles across MRI, CT, and ultrasound videos.
arXiv Detail & Related papers (2025-05-03T16:16:24Z) - EdgeTAM: On-Device Track Anything Model [65.10032957471824]
Segment Anything Model (SAM) 2 further extends its capability from image to video inputs through a memory bank mechanism.<n>We aim at making SAM 2 much more efficient so that it even runs on mobile devices while maintaining a comparable performance.<n>We propose EdgeTAM, which leverages a novel 2D Spatial Perceiver to reduce the computational cost.
arXiv Detail & Related papers (2025-01-13T12:11:07Z) - Surgical SAM 2: Real-time Segment Anything in Surgical Video by Efficient Frame Pruning [13.90996725220123]
We introduce SurgSAM2, an advanced model to utilize SAM2 with an Efficient Frame Pruning (EFP) mechanism, to facilitate real-time surgical video segmentation.<n>SurgSAM2 significantly improves both efficiency and segmentation accuracy compared to the vanilla SAM2.<n>Our experiments demonstrate that SurgSAM2 achieves a 3$times$ FPS compared with SAM2, while also delivering state-of-the-art performance after fine-tuning with lower-resolution data.
arXiv Detail & Related papers (2024-08-15T04:59:12Z) - Zero-Shot Surgical Tool Segmentation in Monocular Video Using Segment Anything Model 2 [4.418542191434178]
The Segment Anything Model 2 (SAM 2) is the latest generation foundation model for image and video segmentation.
We evaluate the zero-shot video segmentation performance of the SAM 2 model across different types of surgeries, including endoscopy and microscopy.
We found that: 1) SAM 2 demonstrates a strong capability for segmenting various surgical videos; 2) When new tools enter the scene, additional prompts are necessary to maintain segmentation accuracy; and 3) Specific challenges inherent to surgical videos can impact the robustness of SAM 2.
arXiv Detail & Related papers (2024-08-03T03:19:56Z) - TinySAM: Pushing the Envelope for Efficient Segment Anything Model [73.06322749886483]
We propose a framework to obtain a tiny segment anything model (TinySAM) while maintaining the strong zero-shot performance.<n>With all these proposed methods, our TinySAM leads to orders of magnitude computational reduction and pushes the envelope for efficient segment anything task.
arXiv Detail & Related papers (2023-12-21T12:26:11Z) - SurgicalSAM: Efficient Class Promptable Surgical Instrument Segmentation [65.52097667738884]
We introduce SurgicalSAM, a novel end-to-end efficient-tuning approach for SAM to integrate surgical-specific information with SAM's pre-trained knowledge for improved generalisation.
Specifically, we propose a lightweight prototype-based class prompt encoder for tuning, which directly generates prompt embeddings from class prototypes.
In addition, to address the low inter-class variance among surgical instrument categories, we propose contrastive prototype learning.
arXiv Detail & Related papers (2023-08-17T02:51:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.