Memory-Augmented SAM2 for Training-Free Surgical Video Segmentation
- URL: http://arxiv.org/abs/2507.09577v1
- Date: Sun, 13 Jul 2025 11:05:25 GMT
- Title: Memory-Augmented SAM2 for Training-Free Surgical Video Segmentation
- Authors: Ming Yin, Fu Wang, Xujiong Ye, Yanda Meng, Zeyu Fu,
- Abstract summary: We introduce Memory Augmented (MA)-SAM2, a training-free video object segmentation strategy.<n>MA-SAM2 exhibits strong robustness against occlusions and interactions arising from complex instrument movements.<n>Without introducing any additional parameters or requiring further training, MA-SAM2 achieved performance improvements of 4.36% and 6.1% over SAM2 on the EndoVis 2017 and EndoVis 2018 datasets.
- Score: 18.71772979219666
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Surgical video segmentation is a critical task in computer-assisted surgery, essential for enhancing surgical quality and patient outcomes. Recently, the Segment Anything Model 2 (SAM2) framework has demonstrated remarkable advancements in both image and video segmentation. However, the inherent limitations of SAM2's greedy selection memory design are amplified by the unique properties of surgical videos-rapid instrument movement, frequent occlusion, and complex instrument-tissue interaction-resulting in diminished performance in the segmentation of complex, long videos. To address these challenges, we introduce Memory Augmented (MA)-SAM2, a training-free video object segmentation strategy, featuring novel context-aware and occlusion-resilient memory models. MA-SAM2 exhibits strong robustness against occlusions and interactions arising from complex instrument movements while maintaining accuracy in segmenting objects throughout videos. Employing a multi-target, single-loop, one-prompt inference further enhances the efficiency of the tracking process in multi-instrument videos. Without introducing any additional parameters or requiring further training, MA-SAM2 achieved performance improvements of 4.36% and 6.1% over SAM2 on the EndoVis2017 and EndoVis2018 datasets, respectively, demonstrating its potential for practical surgical applications.
Related papers
- Depthwise-Dilated Convolutional Adapters for Medical Object Tracking and Segmentation Using the Segment Anything Model 2 [1.0596160761674702]
We propose DD-SAM2, an efficient adaptation framework for SAM2.<n> DD-SAM2 incorporates a Depthwise-Dilated Adapter (DD-Adapter) to enhance multi-scale feature extraction.<n> DD-SAM2 fully exploits SAM2's streaming memory for medical video object tracking and segmentation.
arXiv Detail & Related papers (2025-07-19T13:19:55Z) - SurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language Model [55.13206879750197]
SurgVidLM is the first video language model designed to address both full and fine-grained surgical video comprehension.<n>We introduce the StageFocus mechanism which is a two-stage framework performing the multi-grained, progressive understanding of surgical videos.<n> Experimental results demonstrate that SurgVidLM significantly outperforms state-of-the-art Vid-LLMs in both full and fine-grained video understanding tasks.
arXiv Detail & Related papers (2025-06-22T02:16:18Z) - Accelerating Volumetric Medical Image Annotation via Short-Long Memory SAM 2 [10.279314732888079]
Short-Long Memory SAM 2 (SLM-SAM 2) is a novel architecture that integrates distinct short-term and long-term memory banks to improve segmentation accuracy.<n>We evaluate SLM-SAM 2 on three public datasets covering organs, bones, and muscles across MRI and CT modalities.
arXiv Detail & Related papers (2025-05-03T16:16:24Z) - DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency [91.30252180093333]
We propose the Dual Consistency SAM (DCSAM) method based on prompttuning to adapt SAM and SAM2 for in-context segmentation.<n>Our key insights are to enhance the features of the SAM's prompt encoder in segmentation by providing high-quality visual prompts.<n>Although the proposed DC-SAM is primarily designed for images, it can be seamlessly extended to the video domain with the support SAM2.
arXiv Detail & Related papers (2025-04-16T13:41:59Z) - SAM2-UNet: Segment Anything 2 Makes Strong Encoder for Natural and Medical Image Segmentation [51.90445260276897]
We prove that the Segment Anything Model 2 (SAM2) can be a strong encoder for U-shaped segmentation models.
We propose a simple but effective framework, termed SAM2-UNet, for versatile image segmentation.
arXiv Detail & Related papers (2024-08-16T17:55:38Z) - Surgical SAM 2: Real-time Segment Anything in Surgical Video by Efficient Frame Pruning [13.90996725220123]
We introduce SurgSAM2, an advanced model to utilize SAM2 with an Efficient Frame Pruning (EFP) mechanism, to facilitate real-time surgical video segmentation.<n>SurgSAM2 significantly improves both efficiency and segmentation accuracy compared to the vanilla SAM2.<n>Our experiments demonstrate that SurgSAM2 achieves a 3$times$ FPS compared with SAM2, while also delivering state-of-the-art performance after fine-tuning with lower-resolution data.
arXiv Detail & Related papers (2024-08-15T04:59:12Z) - SAM 2 in Robotic Surgery: An Empirical Evaluation for Robustness and Generalization in Surgical Video Segmentation [13.609341065893739]
This study explores the zero-shot segmentation performance of SAM 2 in robot-assisted surgery based on prompts.
We employ two forms of prompts: 1-point and bounding box, while for video sequences, the 1-point prompt is applied to the initial frame.
The results with point prompts also exhibit a substantial enhancement over SAM's capabilities, nearing or even surpassing existing unprompted SOTA methods.
arXiv Detail & Related papers (2024-08-08T17:08:57Z) - Zero-Shot Surgical Tool Segmentation in Monocular Video Using Segment Anything Model 2 [4.418542191434178]
The Segment Anything Model 2 (SAM 2) is the latest generation foundation model for image and video segmentation.
We evaluate the zero-shot video segmentation performance of the SAM 2 model across different types of surgeries, including endoscopy and microscopy.
We found that: 1) SAM 2 demonstrates a strong capability for segmenting various surgical videos; 2) When new tools enter the scene, additional prompts are necessary to maintain segmentation accuracy; and 3) Specific challenges inherent to surgical videos can impact the robustness of SAM 2.
arXiv Detail & Related papers (2024-08-03T03:19:56Z) - FocSAM: Delving Deeply into Focused Objects in Segmenting Anything [58.042354516491024]
The Segment Anything Model (SAM) marks a notable milestone in segmentation models.
We propose FocSAM with a pipeline redesigned on two pivotal aspects.
First, we propose Dynamic Window Multi-head Self-Attention (Dwin-MSA) to dynamically refocus SAM's image embeddings on the target object.
Second, we propose Pixel-wise Dynamic ReLU (P-DyReLU) to enable sufficient integration of interactive information from a few initial clicks.
arXiv Detail & Related papers (2024-05-29T02:34:13Z) - SurgicalPart-SAM: Part-to-Whole Collaborative Prompting for Surgical Instrument Segmentation [66.21356751558011]
The Segment Anything Model (SAM) exhibits promise in generic object segmentation and offers potential for various applications.
Existing methods have applied SAM to surgical instrument segmentation (SIS) by tuning SAM-based frameworks with surgical data.
We propose SurgicalPart-SAM (SP-SAM), a novel SAM efficient-tuning approach that explicitly integrates instrument structure knowledge with SAM's generic knowledge.
arXiv Detail & Related papers (2023-12-22T07:17:51Z) - One to Many: Adaptive Instrument Segmentation via Meta Learning and
Dynamic Online Adaptation in Robotic Surgical Video [71.43912903508765]
MDAL is a dynamic online adaptive learning scheme for instrument segmentation in robot-assisted surgery.
It learns the general knowledge of instruments and the fast adaptation ability through the video-specific meta-learning paradigm.
It outperforms other state-of-the-art methods on two datasets.
arXiv Detail & Related papers (2021-03-24T05:02:18Z) - Learning Motion Flows for Semi-supervised Instrument Segmentation from
Robotic Surgical Video [64.44583693846751]
We study the semi-supervised instrument segmentation from robotic surgical videos with sparse annotations.
By exploiting generated data pairs, our framework can recover and even enhance temporal consistency of training sequences.
Results show that our method outperforms the state-of-the-art semisupervised methods by a large margin.
arXiv Detail & Related papers (2020-07-06T02:39:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.