Related papers: More than Segmentation: Benchmarking SAM 3 for Segmentation, 3D Perception, and Reconstruction in Robotic Surgery

More than Segmentation: Benchmarking SAM 3 for Segmentation, 3D Perception, and Reconstruction in Robotic Surgery

URL: http://arxiv.org/abs/2512.07596v2
Date: Wed, 10 Dec 2025 07:08:21 GMT
Title: More than Segmentation: Benchmarking SAM 3 for Segmentation, 3D Perception, and Reconstruction in Robotic Surgery
Authors: Wenzhen Dong, Jieming Yu, Yiming Huang, Hongqiu Wang, Lei Zhu, Albert C. S. Chung, Hongliang Ren, Long Bai,
Abstract summary: SAM 3 supports zero-shot segmentation across a wide range of prompts, including point, bounding box, and language-based prompts.<n> SAM 3D shows clear improvements over SAM and SAM 2 in both image and video segmentation under spatial prompts.<n>While language prompts show potential, their performance in the surgical domain is currently suboptimal.
Score: 23.89471097213949
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The recent SAM 3 and SAM 3D have introduced significant advancements over the predecessor, SAM 2, particularly with the integration of language-based segmentation and enhanced 3D perception capabilities. SAM 3 supports zero-shot segmentation across a wide range of prompts, including point, bounding box, and language-based prompts, allowing for more flexible and intuitive interactions with the model. In this empirical evaluation, we assess the performance of SAM 3 in robot-assisted surgery, benchmarking its zero-shot segmentation with point and bounding box prompts and exploring its effectiveness in dynamic video tracking, alongside its newly introduced language prompt segmentation. While language prompts show potential, their performance in the surgical domain is currently suboptimal, highlighting the need for further domain-specific training. Additionally, we investigate SAM 3D's depth reconstruction abilities, demonstrating its capacity to process surgical scene data and reconstruct 3D anatomical structures from 2D images. Through comprehensive testing on the MICCAI EndoVis 2017 and EndoVis 2018 benchmarks, SAM 3 shows clear improvements over SAM and SAM 2 in both image and video segmentation under spatial prompts, while the zero-shot evaluations of SAM 3D on SCARED, StereoMIS, and EndoNeRF indicate strong monocular depth estimation and realistic 3D instrument reconstruction, yet also reveal remaining limitations in complex, highly dynamic surgical scenes.

Related papers

Comparing SAM 2 and SAM 3 for Zero-Shot Segmentation of 3D Medical Data [0.3867363075280543]
Foundation models for promptable segmentation, including SAM, SAM 2, and the recently released SAM 3, have renewed interest in zero-shot segmentation of medical imaging.<n>We present the first controlled comparison of SAM 2 and SAM 3 for zero-shot segmentation of 3D medical volumes and videos under purely visual prompting.<n>We benchmark both models on 16 public datasets covering 54 anatomical structures, pathologies, and surgical instruments.
arXiv Detail & Related papers (2025-11-26T21:36:58Z)
DEAP-3DSAM: Decoder Enhanced and Auto Prompt SAM for 3D Medical Image Segmentation [8.682548299881928]
The Segment Anything Model (SAM) has recently demonstrated significant potential in medical image segmentation.<n>We propose a Feature Enhanced Decoder that fuses the original image features with rich and detailed spatial information to enhance spatial features.<n>We also design a Dual Attention Prompter to automatically obtain prompt information through Spatial Attention and Channel Attention.
arXiv Detail & Related papers (2025-11-24T13:07:22Z)
VesSAM: Efficient Multi-Prompting for Segmenting Complex Vessel [68.24765319399286]
We present VesSAM, a powerful and efficient framework tailored for 2D vessel segmentation.<n>VesSAM integrates (1) a convolutional adapter to enhance local texture features, (2) a multi-prompt encoder that fuses anatomical prompts, and (3) a lightweight mask decoder to reduce jagged artifacts.<n>VesSAM consistently outperforms state-of-the-art PEFT-based SAM variants by over 10% Dice and 13% IoU.
arXiv Detail & Related papers (2025-11-02T15:47:05Z)
SAM2-UNeXT: An Improved High-Resolution Baseline for Adapting Foundation Models to Downstream Segmentation Tasks [50.97089872043121]
We propose SAM2-UNeXT, an advanced framework that builds upon the core principles of SAM2-UNet.<n>We extend the representational capacity of SAM2 through the integration of an auxiliary DINOv2 encoder.<n>Our approach enables more accurate segmentation with a simple architecture, relaxing the need for complex decoder designs.
arXiv Detail & Related papers (2025-08-05T15:36:13Z)
Inspiring the Next Generation of Segment Anything Models: Comprehensively Evaluate SAM and SAM 2 with Diverse Prompts Towards Context-Dependent Concepts under Different Scenes [97.8612925017964]
Large-scale foundation models trained on billions of image--mask pairs cover a vast diversity of scenes, objects, and contexts.<n>SAM and its upgraded version, SAM2, have significantly influenced multiple fields within computer vision.<n>We conduct a thorough evaluation of SAMs on 11 CD concepts across 2D and 3D images and videos in various visual modalities within natural, medical, and industrial scenes.
arXiv Detail & Related papers (2024-12-02T08:03:56Z)
SAM2Point: Segment Any 3D as Videos in Zero-shot and Promptable Manners [87.76470518069338]
We introduce SAM2Point, a preliminary exploration adapting Segment Anything Model 2 (SAM 2) for promptable 3D segmentation. Our framework supports various prompt types, including 3D points, boxes, and masks, and can generalize across diverse scenarios, such as 3D objects, indoor scenes, sparse outdoor environments, and raw LiDAR. To our best knowledge, we present the most faithful implementation of SAM in 3D, which may serve as a starting point for future research in promptable 3D segmentation.
arXiv Detail & Related papers (2024-08-29T17:59:45Z)
A Short Review and Evaluation of SAM2's Performance in 3D CT Image Segmentation [18.543271487108868]
We review existing benchmarks and point out that the SAM2 paper clearly outlines a zero-shot evaluation pipeline. Our findings reveal that directly applying SAM2 on 3D medical imaging in a zero-shot manner is far from satisfactory. For smaller single-connected objects like kidney and aorta, SAM2 performs reasonably well but for most organs it is still far behind state-of-the-art 3D annotation methods.
arXiv Detail & Related papers (2024-08-20T22:08:42Z)
SAM2-UNet: Segment Anything 2 Makes Strong Encoder for Natural and Medical Image Segmentation [51.90445260276897]
We prove that the Segment Anything Model 2 (SAM2) can be a strong encoder for U-shaped segmentation models. We propose a simple but effective framework, termed SAM2-UNet, for versatile image segmentation.
arXiv Detail & Related papers (2024-08-16T17:55:38Z)
SAM 2 in Robotic Surgery: An Empirical Evaluation for Robustness and Generalization in Surgical Video Segmentation [13.609341065893739]
This study explores the zero-shot segmentation performance of SAM 2 in robot-assisted surgery based on prompts. We employ two forms of prompts: 1-point and bounding box, while for video sequences, the 1-point prompt is applied to the initial frame. The results with point prompts also exhibit a substantial enhancement over SAM's capabilities, nearing or even surpassing existing unprompted SOTA methods.
arXiv Detail & Related papers (2024-08-08T17:08:57Z)
Interactive 3D Medical Image Segmentation with SAM 2 [17.523874868612577]
We explore the zero-shot capabilities of SAM 2, the next-generation Meta SAM model trained on videos, for 3D medical image segmentation.<n>By treating sequential 2D slices of 3D images as video frames, SAM 2 can fully automatically propagate annotations from a single frame to the entire 3D volume.
arXiv Detail & Related papers (2024-08-05T16:58:56Z)
MA-SAM: Modality-agnostic SAM Adaptation for 3D Medical Image Segmentation [58.53672866662472]
We introduce a modality-agnostic SAM adaptation framework, named as MA-SAM. Our method roots in the parameter-efficient fine-tuning strategy to update only a small portion of weight increments. By injecting a series of 3D adapters into the transformer blocks of the image encoder, our method enables the pre-trained 2D backbone to extract third-dimensional information from input data.
arXiv Detail & Related papers (2023-09-16T02:41:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.