ENSAM: an efficient foundation model for interactive segmentation of 3D medical images
- URL: http://arxiv.org/abs/2509.15874v1
- Date: Fri, 19 Sep 2025 11:20:22 GMT
- Title: ENSAM: an efficient foundation model for interactive segmentation of 3D medical images
- Authors: Elias Stenhede, Agnar Martin Bjørnstad, Arian Ranjbar,
- Abstract summary: ENSAM is a promptable model for universal 3D medical image segmentation.<n> ENSAM is designed to achieve good performance under limited data and computational budgets.<n> ENSAM was evaluated on hidden test set with multimodal 3D medical images.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present ENSAM (Equivariant, Normalized, Segment Anything Model), a lightweight and promptable model for universal 3D medical image segmentation. ENSAM combines a SegResNet-based encoder with a prompt encoder and mask decoder in a U-Net-style architecture, using latent cross-attention, relative positional encoding, normalized attention, and the Muon optimizer for training. ENSAM is designed to achieve good performance under limited data and computational budgets, and is trained from scratch on under 5,000 volumes from multiple modalities (CT, MRI, PET, ultrasound, microscopy) on a single 32 GB GPU in 6 hours. As part of the CVPR 2025 Foundation Models for Interactive 3D Biomedical Image Segmentation Challenge, ENSAM was evaluated on hidden test set with multimodal 3D medical images, obtaining a DSC AUC of 2.404, NSD AUC of 2.266, final DSC of 0.627, and final NSD of 0.597, outperforming two previously published baseline models (VISTA3D, SAM-Med3D) and matching the third (SegVol), surpassing its performance in final DSC but trailing behind in the other three metrics. In the coreset track of the challenge, ENSAM ranks 5th of 10 overall and best among the approaches not utilizing pretrained weights. Ablation studies confirm that our use of relative positional encodings and the Muon optimizer each substantially speed up convergence and improve segmentation quality.
Related papers
- A Hybrid Mamba-SAM Architecture for Efficient 3D Medical Image Segmentation [0.4358626952482685]
Mamba-SAM is a novel and efficient hybrid architecture that combines a frozen SAM encoder with the linear-time efficiency and long-range modeling capabilities of Mamba-based State Space Models (SSMs)<n>We introduce Multi-Frequency Gated Convolution (MFGC), which enhances feature representation by jointly analyzing spatial and frequency-domain information via 3D discrete cosine transforms and adaptive gating.<n>The dual-branch Mamba-SAM-Base model achieves a mean Dice score of 0.906, comparable to UNet++ (0.907), while outperforming all baselines on Myocardium (0.910) and Left Ventric
arXiv Detail & Related papers (2026-01-31T10:51:17Z) - VesSAM: Efficient Multi-Prompting for Segmenting Complex Vessel [68.24765319399286]
We present VesSAM, a powerful and efficient framework tailored for 2D vessel segmentation.<n>VesSAM integrates (1) a convolutional adapter to enhance local texture features, (2) a multi-prompt encoder that fuses anatomical prompts, and (3) a lightweight mask decoder to reduce jagged artifacts.<n>VesSAM consistently outperforms state-of-the-art PEFT-based SAM variants by over 10% Dice and 13% IoU.
arXiv Detail & Related papers (2025-11-02T15:47:05Z) - MedSAM2: Segment Anything in 3D Medical Images and Videos [16.709180067792538]
We present MedSAM2, a promptable segmentation foundation model for 3D image and video segmentation.<n>The model is developed by fine-tuning the Segment Anything Model 2 on a large medical dataset with over 455,000 3D image-mask pairs and 76,000 frames.<n>Furthermore, we implement a human-in-the-loop pipeline to facilitate the creation of large-scale datasets resulting in, to the best of our knowledge, the most extensive user study to date, involving the annotation of 5,000 CT lesions, 3,984 liver MRI lesions, and 251,550 echocardiogram video frames.
arXiv Detail & Related papers (2025-04-04T17:13:37Z) - Self-Prompt SAM: Medical Image Segmentation via Automatic Prompt SAM Adaptation [14.821036063099458]
Segment Anything Model (SAM) has demonstrated impressive zero-shot performance.<n>We propose a novel self-prompt SAM adaptation framework for medical image segmentation, named Self-Prompt-SAM.<n>Our method achieves state-of-the-art performance and outperforms nnUNet by 2.3% on AMOS2022, 1.6% on ACDCand 0.5% on Synapse datasets.
arXiv Detail & Related papers (2025-02-02T02:42:24Z) - Swin-LiteMedSAM: A Lightweight Box-Based Segment Anything Model for Large-Scale Medical Image Datasets [0.6827423171182151]
We introduce Swin-LiteMedSAM, a new variant of LiteMedSAM.
This model integrates the tiny Swin Transformer as the image encoder, incorporates multiple types of prompts, and establishes skip connections between the image encoder and the mask decoder.
In the textitSegment Anything in Medical Images on Laptop challenge (CVPR 2024), our approach strikes a good balance between segmentation performance and speed.
arXiv Detail & Related papers (2024-09-11T10:35:42Z) - Improved Baselines with Synchronized Encoding for Universal Medical Image Segmentation [34.08601740109437]
We introduce SyncSAM, which employs a synchronized dual-branch encoder that integrates convolution and Transformer features in a synchronized manner to enhance medical image encoding.<n>SyncSAM achieves state-of-the-art performance on test sets and also exhibits strong zero-shot capabilities on unseen datasets.
arXiv Detail & Related papers (2024-08-19T11:01:00Z) - Stitching, Fine-tuning, Re-training: A SAM-enabled Framework for Semi-supervised 3D Medical Image Segmentation [40.79197318484472]
Segment Anything Model (SAM) fine-tuning has shown remarkable performance in medical image segmentation in a fully supervised manner.<n>We propose a three-stage framework, i.e., Stitching, Fine-tuning, and Re-training (SFR)<n>Our SFR framework is plug-and-play, and easily compatible with various popular semi-supervised methods.
arXiv Detail & Related papers (2024-03-17T14:30:56Z) - Large-Vocabulary Segmentation for Medical Images with Text Prompts [68.9193694019039]
This paper aims to build a model that can Segment Anything in 3D medical images, driven by medical terminologies as Text prompts, termed as SAT.<n>We construct the first multimodal knowledge tree on human anatomy, including 6502 anatomical terminologies.<n>We build the largest and most comprehensive segmentation dataset for training, collecting over 22K 3D scans from 72 datasets.
arXiv Detail & Related papers (2023-12-28T18:16:00Z) - MA-SAM: Modality-agnostic SAM Adaptation for 3D Medical Image
Segmentation [58.53672866662472]
We introduce a modality-agnostic SAM adaptation framework, named as MA-SAM.
Our method roots in the parameter-efficient fine-tuning strategy to update only a small portion of weight increments.
By injecting a series of 3D adapters into the transformer blocks of the image encoder, our method enables the pre-trained 2D backbone to extract third-dimensional information from input data.
arXiv Detail & Related papers (2023-09-16T02:41:53Z) - 3DSAM-adapter: Holistic adaptation of SAM from 2D to 3D for promptable tumor segmentation [52.699139151447945]
We propose a novel adaptation method for transferring the segment anything model (SAM) from 2D to 3D for promptable medical image segmentation.
Our model can outperform domain state-of-the-art medical image segmentation models on 3 out of 4 tasks, specifically by 8.25%, 29.87%, and 10.11% for kidney tumor, pancreas tumor, colon cancer segmentation, and achieve similar performance for liver tumor segmentation.
arXiv Detail & Related papers (2023-06-23T12:09:52Z) - LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical
Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets.
We have collected approximately 1.3 million medical images from 55 publicly available datasets.
LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z) - Automatic size and pose homogenization with spatial transformer network
to improve and accelerate pediatric segmentation [51.916106055115755]
We propose a new CNN architecture that is pose and scale invariant thanks to the use of Spatial Transformer Network (STN)
Our architecture is composed of three sequential modules that are estimated together during training.
We test the proposed method in kidney and renal tumor segmentation on abdominal pediatric CT scanners.
arXiv Detail & Related papers (2021-07-06T14:50:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.