Large-Vocabulary Segmentation for Medical Images with Text Prompts
- URL: http://arxiv.org/abs/2312.17183v5
- Date: Fri, 18 Jul 2025 10:57:30 GMT
- Title: Large-Vocabulary Segmentation for Medical Images with Text Prompts
- Authors: Ziheng Zhao, Yao Zhang, Chaoyi Wu, Xiaoman Zhang, Xiao Zhou, Ya Zhang, Yanfeng Wang, Weidi Xie,
- Abstract summary: This paper aims to build a model that can Segment Anything in 3D medical images, driven by medical terminologies as Text prompts, termed as SAT.<n>We construct the first multimodal knowledge tree on human anatomy, including 6502 anatomical terminologies.<n>We build the largest and most comprehensive segmentation dataset for training, collecting over 22K 3D scans from 72 datasets.
- Score: 68.9193694019039
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper aims to build a model that can Segment Anything in 3D medical images, driven by medical terminologies as Text prompts, termed as SAT. Our main contributions are three-fold: (i) We construct the first multimodal knowledge tree on human anatomy, including 6502 anatomical terminologies; Then, we build the largest and most comprehensive segmentation dataset for training, collecting over 22K 3D scans from 72 datasets, across 497 classes, with careful standardization on both image and label space; (ii) We propose to inject medical knowledge into a text encoder via contrastive learning and formulate a large-vocabulary segmentation model that can be prompted by medical terminologies in text form; (iii) We train SAT-Nano (110M parameters) and SAT-Pro (447M parameters). SAT-Pro achieves comparable performance to 72 nnU-Nets -- the strongest specialist models trained on each dataset (over 2.2B parameters combined) -- over 497 categories. Compared with the interactive approach MedSAM, SAT-Pro consistently outperforms across all 7 human body regions with +7.1% average Dice Similarity Coefficient (DSC) improvement, while showing enhanced scalability and robustness. On 2 external (cross-center) datasets, SAT-Pro achieves higher performance than all baselines (+3.7% average DSC), demonstrating superior generalization ability.
Related papers
- Medal S: Spatio-Textual Prompt Model for Medical Segmentation [19.872612663709656]
Medal S supports native-resolution spatial and textual prompts within an end-to-end trainable framework.<n>It efficiently processes multiple native-resolution masks in parallel, enhancing multi-class segmentation performance.<n>A lightweight 3D convolutional module enables precise voxel-space refinement guided by both prompt types.
arXiv Detail & Related papers (2025-11-17T05:44:19Z) - Scaling Tumor Segmentation: Best Lessons from Real and Synthetic Data [62.63749675817477]
AbdomenAtlas 2.0 is a dataset of 10,135 CT scans with a total of 15,130 tumor instances per-voxel manually annotated in six organs.<n>It achieves notable improvements over public datasets, with a +7% gain on DSC tests and +16% on out-of-distribution tests.
arXiv Detail & Related papers (2025-10-16T16:08:09Z) - ENSAM: an efficient foundation model for interactive segmentation of 3D medical images [0.0]
ENSAM is a promptable model for universal 3D medical image segmentation.<n> ENSAM is designed to achieve good performance under limited data and computational budgets.<n> ENSAM was evaluated on hidden test set with multimodal 3D medical images.
arXiv Detail & Related papers (2025-09-19T11:20:22Z) - MedSAM2: Segment Anything in 3D Medical Images and Videos [16.709180067792538]
We present MedSAM2, a promptable segmentation foundation model for 3D image and video segmentation.
The model is developed by fine-tuning the Segment Anything Model 2 on a large medical dataset with over 455,000 3D image-mask pairs and 76,000 frames.
Furthermore, we implement a human-in-the-loop pipeline to facilitate the creation of large-scale datasets resulting in, to the best of our knowledge, the most extensive user study to date, involving the annotation of 5,000 CT lesions, 3,984 liver MRI lesions, and 251,550 echocardiogram video frames.
arXiv Detail & Related papers (2025-04-04T17:13:37Z) - A Continual Learning-driven Model for Accurate and Generalizable Segmentation of Clinically Comprehensive and Fine-grained Whole-body Anatomies in CT [67.34586036959793]
There is no fully annotated CT dataset with all anatomies delineated for training.<n>We propose a novel continual learning-driven CT model that can segment complete anatomies.<n>Our single unified CT segmentation model, CL-Net, can highly accurately segment a clinically comprehensive set of 235 fine-grained whole-body anatomies.
arXiv Detail & Related papers (2025-03-16T23:55:02Z) - TotalSegmentator MRI: Sequence-Independent Segmentation of 59 Anatomical Structures in MR images [62.53931644063323]
In this study we extended the capabilities of TotalSegmentator to MR images.
We trained an nnU-Net segmentation algorithm on this dataset and calculated similarity coefficients (Dice) to evaluate the model's performance.
The model significantly outperformed two other publicly available segmentation models (Dice score 0.824 versus 0.762; p0.001 and 0.762 versus 0.542; p)
arXiv Detail & Related papers (2024-05-29T20:15:54Z) - Universal and Extensible Language-Vision Models for Organ Segmentation and Tumor Detection from Abdominal Computed Tomography [50.08496922659307]
We propose a universal framework enabling a single model, termed Universal Model, to deal with multiple public datasets and adapt to new classes.
Firstly, we introduce a novel language-driven parameter generator that leverages language embeddings from large language models.
Secondly, the conventional output layers are replaced with lightweight, class-specific heads, allowing Universal Model to simultaneously segment 25 organs and six types of tumors.
arXiv Detail & Related papers (2024-05-28T16:55:15Z) - Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - RadGenome-Chest CT: A Grounded Vision-Language Dataset for Chest CT Analysis [56.57177181778517]
RadGenome-Chest CT is a large-scale, region-guided 3D chest CT interpretation dataset based on CT-RATE.
We leverage the latest powerful universal segmentation and large language models to extend the original datasets.
arXiv Detail & Related papers (2024-04-25T17:11:37Z) - CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios [53.94122089629544]
We introduce CT-GLIP (Grounded Language-Image Pretraining with CT scans), a novel method that constructs organ-level image-text pairs to enhance multimodal contrastive learning.
Our method, trained on a multimodal CT dataset comprising 44,011 organ-level vision-text pairs from 17,702 patients across 104 organs, demonstrates it can identify organs and abnormalities in a zero-shot manner using natural languages.
arXiv Detail & Related papers (2024-04-23T17:59:01Z) - SegVol: Universal and Interactive Volumetric Medical Image Segmentation [25.322437534713163]
We propose a 3D foundation segmentation model, named SegVol, supporting universal and interactive volumetric medical image segmentation.
By scaling up training data to 90K unlabeled Computed Tomography (CT) volumes and 6K labeled CT volumes, this foundation model supports the segmentation of over 200 anatomical categories.
Experiments on 22 anatomical segmentation tasks verify that SegVol outperforms the competitors in 19 tasks, with improvements up to 37.24% compared to the runner-up methods.
arXiv Detail & Related papers (2023-11-22T13:27:36Z) - SAM-Med3D: Towards General-purpose Segmentation Models for Volumetric Medical Images [35.83393121891959]
We introduce SAM-Med3D for general-purpose segmentation on volumetric medical images.
SAM-Med3D can accurately segment diverse anatomical structures and lesions across various modalities.
Our approach demonstrates that substantial medical resources can be utilized to develop a general-purpose medical AI.
arXiv Detail & Related papers (2023-10-23T17:57:36Z) - PMC-LLaMA: Towards Building Open-source Language Models for Medicine [62.39105735933138]
Large Language Models (LLMs) have showcased remarkable capabilities in natural language understanding.
LLMs struggle in domains that require precision, such as medical applications, due to their lack of domain-specific knowledge.
We describe the procedure for building a powerful, open-source language model specifically designed for medicine applications, termed as PMC-LLaMA.
arXiv Detail & Related papers (2023-04-27T18:29:05Z) - ConTEXTual Net: A Multimodal Vision-Language Model for Segmentation of
Pneumothorax [5.168314889999992]
We propose a novel vision-language model, ConTEXTual Net, for the task of pneumothorax segmentation on chest radiographs.
We trained it on the CANDID-PTX dataset consisting of 3,196 positive cases of pneumothorax.
It achieved a Dice score of 0.716$pm$0.016, which was similar to the degree of inter-reader variability.
It outperformed both vision-only models and a competing vision-language model.
arXiv Detail & Related papers (2023-03-02T22:36:19Z) - Continual Segment: Towards a Single, Unified and Accessible Continual
Segmentation Model of 143 Whole-body Organs in CT Scans [31.388497540849297]
We propose a new architectural CSS learning framework to learn a single deep segmentation model for segmenting a total of 143 whole-body organs.
We trained and validated on 3D CT scans of 2500+ patients from four datasets, our single network can segment total 143 whole-body organs with very high accuracy.
arXiv Detail & Related papers (2023-02-01T00:49:21Z) - CLIP-Driven Universal Model for Organ Segmentation and Tumor Detection [36.08551407926805]
We propose the CLIP-Driven Universal Model, which incorporates text embedding learned from Contrastive Language-Image Pre-training to segmentation models.
The proposed model is developed from an assembly of 14 datasets, using a total of 3,410 CT scans for training and then evaluated on 6,162 external CT scans from 3 additional datasets.
arXiv Detail & Related papers (2023-01-02T18:07:44Z) - Prototypical few-shot segmentation for cross-institution male pelvic
structures with spatial registration [24.089382725904304]
This work describes a fully 3D few-shot segmentation algorithm.
The trained networks can be effectively adapted to clinically interesting structures that are absent in training.
Experiments are presented in an application of segmenting eight anatomical structures important for interventional planning.
arXiv Detail & Related papers (2022-09-12T11:34:57Z) - Universal Segmentation of 33 Anatomies [19.194539991903593]
We present an approach for learning a single model that universally segments 33 anatomical structures.
We learn such a model from a union of multiple datasets, with each dataset containing the images that are partially labeled.
We evaluate our model on multiple open-source datasets, proving that our model has a good generalization performance.
arXiv Detail & Related papers (2022-03-04T02:29:54Z) - Few-shot image segmentation for cross-institution male pelvic organs
using registration-assisted prototypical learning [13.567073992605797]
This work presents the first 3D few-shot interclass segmentation network for medical images.
It uses a labelled multi-institution dataset from prostate cancer patients with eight regions of interest.
A built-in registration mechanism can effectively utilise the prior knowledge of consistent anatomy between subjects.
arXiv Detail & Related papers (2022-01-17T11:44:10Z) - Learning Contextualized Document Representations for Healthcare Answer
Retrieval [68.02029435111193]
Contextual Discourse Vectors (CDV) is a distributed document representation for efficient answer retrieval from long documents.
Our model leverages a dual encoder architecture with hierarchical LSTM layers and multi-task training to encode the position of clinical entities and aspects alongside the document discourse.
We show that our generalized model significantly outperforms several state-of-the-art baselines for healthcare passage ranking.
arXiv Detail & Related papers (2020-02-03T15:47:19Z) - VerSe: A Vertebrae Labelling and Segmentation Benchmark for
Multi-detector CT Images [121.31355003451152]
Large Scale Vertebrae Challenge (VerSe) was organised in conjunction with the International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI) in 2019 and 2020.
We present the the results of this evaluation and further investigate the performance-variation at vertebra-level, scan-level, and at different fields-of-view.
arXiv Detail & Related papers (2020-01-24T21:09:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.