Related papers: CLAPS: A CLIP-Unified Auto-Prompt Segmentation for Multi-Modal Retinal Imaging

CLAPS: A CLIP-Unified Auto-Prompt Segmentation for Multi-Modal Retinal Imaging

URL: http://arxiv.org/abs/2509.08618v1
Date: Wed, 10 Sep 2025 14:14:49 GMT
Title: CLAPS: A CLIP-Unified Auto-Prompt Segmentation for Multi-Modal Retinal Imaging
Authors: Zhihao Zhao, Yinzheng Zhao, Junjie Yang, Xiangtong Yao, Quanmin Liang, Shahrooz Faghihroohi, Kai Huang, Nassir Navab, M. Ali Nasseri,
Abstract summary: We propose CLIP-unified Auto-Prompt (CLAPS), a novel method for unified segmentation across diverse tasks and modalities in retinal imaging.<n>Our approach begins by pre-training a CLIP-based image encoder on a large, multi-modal retinal dataset.<n>To unify tasks and resolve ambiguity, we use text prompts enhanced with a unique "modality signature" for each imaging modality.
Score: 47.04292769940597
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advancements in foundation models, such as the Segment Anything Model (SAM), have significantly impacted medical image segmentation, especially in retinal imaging, where precise segmentation is vital for diagnosis. Despite this progress, current methods face critical challenges: 1) modality ambiguity in textual disease descriptions, 2) a continued reliance on manual prompting for SAM-based workflows, and 3) a lack of a unified framework, with most methods being modality- and task-specific. To overcome these hurdles, we propose CLIP-unified Auto-Prompt Segmentation (\CLAPS), a novel method for unified segmentation across diverse tasks and modalities in retinal imaging. Our approach begins by pre-training a CLIP-based image encoder on a large, multi-modal retinal dataset to handle data scarcity and distribution imbalance. We then leverage GroundingDINO to automatically generate spatial bounding box prompts by detecting local lesions. To unify tasks and resolve ambiguity, we use text prompts enhanced with a unique "modality signature" for each imaging modality. Ultimately, these automated textual and spatial prompts guide SAM to execute precise segmentation, creating a fully automated and unified pipeline. Extensive experiments on 12 diverse datasets across 11 critical segmentation categories show that CLAPS achieves performance on par with specialized expert models while surpassing existing benchmarks across most metrics, demonstrating its broad generalizability as a foundation model.

Related papers

TGC-Net: A Structure-Aware and Semantically-Aligned Framework for Text-Guided Medical Image Segmentation [56.09179939570486]
We propose TGC-Net, a CLIP-based framework focusing on parameter-efficient, task-specific adaptations.<n>TGC-Net achieves state-of-the-art performance with substantially fewer trainable parameters, including notable Dice gains on challenging benchmarks.
arXiv Detail & Related papers (2025-12-24T12:06:26Z)
Multimodal Causal-Driven Representation Learning for Generalizable Medical Image Segmentation [56.52520416420957]
We propose Multimodal Causal-Driven Representation Learning (MCDRL) to tackle domain generalization in medical image segmentation.<n>MCDRL consistently outperforms competing methods, yielding superior segmentation accuracy and exhibiting robust generalizability.
arXiv Detail & Related papers (2025-08-07T03:41:41Z)
CRISP-SAM2: SAM2 with Cross-Modal Interaction and Semantic Prompting for Multi-Organ Segmentation [32.48945636401865]
We introduce a novel model named CRISP-SAM2 with CRoss-modal Interaction and Semantic Prompting based on SAM2.<n>This model represents a promising approach to multi-organ medical segmentation guided by textual descriptions of organs.<n>Our method begins by converting visual and textual inputs into cross-modal contextualized semantics.
arXiv Detail & Related papers (2025-06-29T07:05:27Z)
Domain and Task-Focused Example Selection for Data-Efficient Contrastive Medical Image Segmentation [0.2765106384328772]
We propose a novel self-supervised contrastive learning framework for medical image segmentation, dubbed PolyCL.<n>PolyCL learns and transfers context-aware discriminant features useful for segmentation from an innovative surrogate.<n>We show that PolyCL outperforms fully-supervised and self-supervised baselines in both low-data and cross-domain scenarios.
arXiv Detail & Related papers (2025-05-25T16:11:48Z)
Enhancing SAM with Efficient Prompting and Preference Optimization for Semi-supervised Medical Image Segmentation [30.524999223901645]
We propose an enhanced Segment Anything Model (SAM) framework that utilizes annotation-efficient prompts generated in a fully unsupervised fashion.<n>We adopt the direct preference optimization technique to design an optimal policy that enables the model to generate high-fidelity segmentations.<n>State-of-the-art performance of our framework in tasks such as lung segmentation, breast tumor segmentation, and organ segmentation across various modalities, including X-ray, ultrasound, and abdominal CT, justifies its effectiveness in low-annotation data scenarios.
arXiv Detail & Related papers (2025-03-06T17:28:48Z)
MedCLIP-SAMv2: Towards Universal Text-Driven Medical Image Segmentation [2.2585213273821716]
We introduce MedCLIP-SAMv2, a novel framework that integrates the CLIP and SAM models to perform segmentation on clinical scans.<n>Our approach includes fine-tuning the BiomedCLIP model with a new Decoupled Hard Negative Noise Contrastive Estimation (DHN-NCE) loss.<n>We also investigate using zero-shot segmentation labels within a weakly supervised paradigm to enhance segmentation quality further.
arXiv Detail & Related papers (2024-09-28T23:10:37Z)
CycleSAM: Few-Shot Surgical Scene Segmentation with Cycle- and Scene-Consistent Feature Matching [2.595014480295933]
CycleSAM is an improved visual prompt learning approach that employs a data-efficient training phase and enforces a series of soft constraints.<n>We find that CycleSAM outperforms existing few-shot SAM approaches by a factor of 2-4x in both 1-shot and 5-shot settings.
arXiv Detail & Related papers (2024-07-09T12:08:07Z)
PosSAM: Panoptic Open-vocabulary Segment Anything [58.72494640363136]
PosSAM is an open-vocabulary panoptic segmentation model that unifies the strengths of the Segment Anything Model (SAM) with the vision-native CLIP model in an end-to-end framework. We introduce a Mask-Aware Selective Ensembling (MASE) algorithm that adaptively enhances the quality of generated masks and boosts the performance of open-vocabulary classification during inference for each image.
arXiv Detail & Related papers (2024-03-14T17:55:03Z)
UN-SAM: Universal Prompt-Free Segmentation for Generalized Nuclei Images [47.59627416801523]
In digital pathology, precise nuclei segmentation is pivotal yet challenged by the diversity of tissue types, staining protocols, and imaging conditions. We propose the Universal prompt-free SAM framework for Nuclei segmentation (UN-SAM) UN-SAM with exceptional performance surpasses state-of-the-arts in nuclei instance and semantic segmentation, especially the generalization capability in zero-shot scenarios.
arXiv Detail & Related papers (2024-02-26T15:35:18Z)
BLO-SAM: Bi-level Optimization Based Overfitting-Preventing Finetuning of SAM [37.1263294647351]
We introduce BLO-SAM, which finetunes the Segment Anything Model (SAM) based on bi-level optimization (BLO) BLO-SAM reduces the risk of overfitting by training the model's weight parameters and the prompt embedding on two separate subsets of the training dataset. Results demonstrate BLO-SAM's superior performance over various state-of-the-art image semantic segmentation methods.
arXiv Detail & Related papers (2024-02-26T06:36:32Z)
Self-Supervised Correction Learning for Semi-Supervised Biomedical Image Segmentation [84.58210297703714]
We propose a self-supervised correction learning paradigm for semi-supervised biomedical image segmentation. We design a dual-task network, including a shared encoder and two independent decoders for segmentation and lesion region inpainting. Experiments on three medical image segmentation datasets for different tasks demonstrate the outstanding performance of our method.
arXiv Detail & Related papers (2023-01-12T08:19:46Z)
A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained Vision-language Model [61.58071099082296]
It is unclear how to make zero-shot recognition working well on broader vision problems, such as object detection and semantic segmentation. In this paper, we target for zero-shot semantic segmentation, by building it on an off-the-shelf pre-trained vision-language model, i.e., CLIP. Our experimental results show that this simple framework surpasses previous state-of-the-arts by a large margin.
arXiv Detail & Related papers (2021-12-29T18:56:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.