Related papers: SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking

SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking

URL: http://arxiv.org/abs/2511.16618v1
Date: Thu, 20 Nov 2025 18:18:49 GMT
Title: SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking
Authors: Haofeng Liu, Ziyue Wang, Sudhanshu Mishra, Mingqi Gao, Guanyi Qin, Chang Han Low, Alex Y. W. Kong, Yueming Jin,
Abstract summary: Surgical segmentation is crucial for computer-assisted surgery, enabling precise localization and tracking of instruments and tissues.<n>Interactive Video Object (iVOS) models such as Segment Anything Model 2 (SAM2) provide prompt-based flexibility beyond methods with predefined categories, but face challenges in surgical scenarios due to the domain gap and limited long-term tracking.<n>We construct SA-SV, the largest surgical iVOS benchmark with instance-level temporal annotations (masklets) spanning eight procedure types (61k frames, 1.6k masklets)<n>We propose SAM2S, a foundation model enhancing bftextSAM2 for
Score: 15.279735515011817
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Surgical video segmentation is crucial for computer-assisted surgery, enabling precise localization and tracking of instruments and tissues. Interactive Video Object Segmentation (iVOS) models such as Segment Anything Model 2 (SAM2) provide prompt-based flexibility beyond methods with predefined categories, but face challenges in surgical scenarios due to the domain gap and limited long-term tracking. To address these limitations, we construct SA-SV, the largest surgical iVOS benchmark with instance-level spatio-temporal annotations (masklets) spanning eight procedure types (61k frames, 1.6k masklets), enabling comprehensive development and evaluation for long-term tracking and zero-shot generalization. Building on SA-SV, we propose SAM2S, a foundation model enhancing \textbf{SAM2} for \textbf{S}urgical iVOS through: (1) DiveMem, a trainable diverse memory mechanism for robust long-term tracking; (2) temporal semantic learning for instrument understanding; and (3) ambiguity-resilient learning to mitigate annotation inconsistencies across multi-source datasets. Extensive experiments demonstrate that fine-tuning on SA-SV enables substantial performance gains, with SAM2 improving by 12.99 average $\mathcal{J}$\&$\mathcal{F}$ over vanilla SAM2. SAM2S further advances performance to 80.42 average $\mathcal{J}$\&$\mathcal{F}$, surpassing vanilla and fine-tuned SAM2 by 17.10 and 4.11 points respectively, while maintaining 68 FPS real-time inference and strong zero-shot generalization. Code and dataset will be released at https://jinlab-imvr.github.io/SAM2S.

Related papers

Evaluating SAM2 for Video Semantic Segmentation [60.157605818225186]
The Anything Model 2 (SAM2) has proven to be a powerful foundation model for promptable visual object segmentation in both images and videos.<n>This paper explores the extension of SAM2 to dense Video Semantic (VSS)<n>Our experiments suggest that leveraging SAM2 enhances overall performance in VSS, primarily due to its precise predictions of object boundaries.
arXiv Detail & Related papers (2025-12-01T15:15:16Z)
UnSAMv2: Self-Supervised Learning Enables Segment Anything at Any Granularity [54.41309926099154]
We introduce UnSAMv2, which enables segment anything at any granularity without human annotations.<n>UnSAMv2 extends the divide-and-conquer strategy of UnSAM by discovering abundant mask-granularity pairs.<n>We show that small amounts of unlabeled data with a granularity-aware self-supervised learning method can unlock the potential of vision foundation models.
arXiv Detail & Related papers (2025-11-17T18:58:34Z)
VesSAM: Efficient Multi-Prompting for Segmenting Complex Vessel [68.24765319399286]
We present VesSAM, a powerful and efficient framework tailored for 2D vessel segmentation.<n>VesSAM integrates (1) a convolutional adapter to enhance local texture features, (2) a multi-prompt encoder that fuses anatomical prompts, and (3) a lightweight mask decoder to reduce jagged artifacts.<n>VesSAM consistently outperforms state-of-the-art PEFT-based SAM variants by over 10% Dice and 13% IoU.
arXiv Detail & Related papers (2025-11-02T15:47:05Z)
TSMS-SAM2: Multi-scale Temporal Sampling Augmentation and Memory-Splitting Pruning for Promptable Video Object Segmentation and Tracking in Surgical Scenarios [1.0596160761674702]
We propose TSMS-SAM2, a novel framework that enhances promptable VOST in surgical videos by addressing challenges of rapid object motion and memory redundancy.<n>TSMS-SAM2 introduces two key strategies: multi-temporal-scale video sampling augmentation to improve robustness against motion variability, and a memory splitting and pruning mechanism that organizes and filters past frame features.
arXiv Detail & Related papers (2025-08-07T20:11:15Z)
Depthwise-Dilated Convolutional Adapters for Medical Object Tracking and Segmentation Using the Segment Anything Model 2 [3.2852663769413106]
We propose DD-SAM2, an efficient adaptation framework for SAM2.<n> DD-SAM2 incorporates a Depthwise-Dilated Adapter (DD-Adapter) to enhance multi-scale feature extraction.<n> DD-SAM2 fully exploits SAM2's streaming memory for medical video object tracking and segmentation.
arXiv Detail & Related papers (2025-07-19T13:19:55Z)
Memory-Augmented SAM2 for Training-Free Surgical Video Segmentation [18.71772979219666]
We introduce Memory Augmented (MA)-SAM2, a training-free video object segmentation strategy.<n>MA-SAM2 exhibits strong robustness against occlusions and interactions arising from complex instrument movements.<n>Without introducing any additional parameters or requiring further training, MA-SAM2 achieved performance improvements of 4.36% and 6.1% over SAM2 on the EndoVis 2017 and EndoVis 2018 datasets.
arXiv Detail & Related papers (2025-07-13T11:05:25Z)
Accelerating Volumetric Medical Image Annotation via Short-Long Memory SAM 2 [12.243345510831263]
Short-Long Memory SAM 2 (SLM-SAM 2) is a novel architecture that integrates distinct short-term and long-term memory banks to improve segmentation accuracy.<n>We evaluate SLM-SAM 2 on four public datasets covering organs, bones, and muscles across MRI, CT, and ultrasound videos.
arXiv Detail & Related papers (2025-05-03T16:16:24Z)
EdgeTAM: On-Device Track Anything Model [65.10032957471824]
Segment Anything Model (SAM) 2 further extends its capability from image to video inputs through a memory bank mechanism.<n>We aim at making SAM 2 much more efficient so that it even runs on mobile devices while maintaining a comparable performance.<n>We propose EdgeTAM, which leverages a novel 2D Spatial Perceiver to reduce the computational cost.
arXiv Detail & Related papers (2025-01-13T12:11:07Z)
Surgical SAM 2: Real-time Segment Anything in Surgical Video by Efficient Frame Pruning [13.90996725220123]
We introduce SurgSAM2, an advanced model to utilize SAM2 with an Efficient Frame Pruning (EFP) mechanism, to facilitate real-time surgical video segmentation.<n>SurgSAM2 significantly improves both efficiency and segmentation accuracy compared to the vanilla SAM2.<n>Our experiments demonstrate that SurgSAM2 achieves a 3$times$ FPS compared with SAM2, while also delivering state-of-the-art performance after fine-tuning with lower-resolution data.
arXiv Detail & Related papers (2024-08-15T04:59:12Z)
Zero-Shot Surgical Tool Segmentation in Monocular Video Using Segment Anything Model 2 [4.418542191434178]
The Segment Anything Model 2 (SAM 2) is the latest generation foundation model for image and video segmentation. We evaluate the zero-shot video segmentation performance of the SAM 2 model across different types of surgeries, including endoscopy and microscopy. We found that: 1) SAM 2 demonstrates a strong capability for segmenting various surgical videos; 2) When new tools enter the scene, additional prompts are necessary to maintain segmentation accuracy; and 3) Specific challenges inherent to surgical videos can impact the robustness of SAM 2.
arXiv Detail & Related papers (2024-08-03T03:19:56Z)
TinySAM: Pushing the Envelope for Efficient Segment Anything Model [73.06322749886483]
We propose a framework to obtain a tiny segment anything model (TinySAM) while maintaining the strong zero-shot performance.<n>With all these proposed methods, our TinySAM leads to orders of magnitude computational reduction and pushes the envelope for efficient segment anything task.
arXiv Detail & Related papers (2023-12-21T12:26:11Z)
SurgicalSAM: Efficient Class Promptable Surgical Instrument Segmentation [65.52097667738884]
We introduce SurgicalSAM, a novel end-to-end efficient-tuning approach for SAM to integrate surgical-specific information with SAM's pre-trained knowledge for improved generalisation. Specifically, we propose a lightweight prototype-based class prompt encoder for tuning, which directly generates prompt embeddings from class prototypes. In addition, to address the low inter-class variance among surgical instrument categories, we propose contrastive prototype learning.
arXiv Detail & Related papers (2023-08-17T02:51:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.