Related papers: MedCLIPSeg: Probabilistic Vision-Language Adaptation for Data-Efficient and Generalizable Medical Image Segmentation

MedCLIPSeg: Probabilistic Vision-Language Adaptation for Data-Efficient and Generalizable Medical Image Segmentation

URL: http://arxiv.org/abs/2602.20423v1
Date: Mon, 23 Feb 2026 23:46:05 GMT
Title: MedCLIPSeg: Probabilistic Vision-Language Adaptation for Data-Efficient and Generalizable Medical Image Segmentation
Authors: Taha Koleilat, Hojat Asgariandehkordi, Omid Nejati Manzari, Berardino Barile, Yiming Xiao, Hassan Rivaz,
Abstract summary: We present MedCLIPSeg, a novel framework that adapts CLIP for robust, data-efficient, and uncertainty-aware medical image segmentation.<n>Our approach leverages patch-level CLIP embeddings through probabilistic cross-modal attention, enabling bidirectional interaction between image and text tokens.
Score: 8.913012426353154
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Medical image segmentation remains challenging due to limited annotations for training, ambiguous anatomical features, and domain shifts. While vision-language models such as CLIP offer strong cross-modal representations, their potential for dense, text-guided medical image segmentation remains underexplored. We present MedCLIPSeg, a novel framework that adapts CLIP for robust, data-efficient, and uncertainty-aware medical image segmentation. Our approach leverages patch-level CLIP embeddings through probabilistic cross-modal attention, enabling bidirectional interaction between image and text tokens and explicit modeling of predictive uncertainty. Together with a soft patch-level contrastive loss that encourages more nuanced semantic learning across diverse textual prompts, MedCLIPSeg effectively improves data efficiency and domain generalizability. Extensive experiments across 16 datasets spanning five imaging modalities and six organs demonstrate that MedCLIPSeg outperforms prior methods in accuracy, efficiency, and robustness, while providing interpretable uncertainty maps that highlight local reliability of segmentation results. This work demonstrates the potential of probabilistic vision-language modeling for text-driven medical image segmentation.

Related papers

BiCLIP: Bidirectional and Consistent Language-Image Processing for Robust Medical Image Segmentation [3.7276397365086233]
BiCLIP is a framework engineered to bolster robustness in medical segmentation.<n>It features a bidirectional multimodal fusion mechanism that enables visual features to iteratively refine textual representations.<n>It exhibits significant resistance to clinical artifacts, including motion blur and low-dose CT noise.
arXiv Detail & Related papers (2026-02-25T18:11:47Z)
MedProbCLIP: Probabilistic Adaptation of Vision-Language Foundation Model for Reliable Radiograph-Report Retrieval [3.7054279251399507]
This work introduces MedProbCLIP, a probabilistic vision-language learning framework for chest X-ray and radiology report representation learning and bidirectional retrieval.<n>The framework employs multi-view radiograph encoding and multi-section report encoding during training to provide fine-grained supervision for clinically aligned correspondence.<n>It outperforms deterministic and probabilistic baselines, including CLIP, CXR-CLIP, and PCME++, in both retrieval and zero-shot classification.
arXiv Detail & Related papers (2026-02-17T21:20:32Z)
Uncertainty-Aware Vision-Language Segmentation for Medical Imaging [12.545486211087791]
We introduce a novel uncertainty-aware multimodal segmentation framework for medical diagnosis.<n>We propose a Modality Decoding Attention Block (MoDAB) with a lightweight State Space Mixer (SSMix) to enable efficient cross-modal fusion.<n>Our results highlight the importance of incorporating uncertainty modelling and structured modality alignment in vision-language medical segmentation tasks.
arXiv Detail & Related papers (2026-02-16T06:27:51Z)
MAMBO-NET: Multi-Causal Aware Modeling Backdoor-Intervention Optimization for Medical Image Segmentation Network [51.68708264694361]
Confusion factors can affect medical images, such as complex anatomical variations and imaging modality limitations.<n>We propose a multi-causal aware modeling backdoor-intervention optimization network for medical image segmentation.<n>Our method significantly reduces the influence of confusion factors, leading to enhanced segmentation accuracy.
arXiv Detail & Related papers (2025-05-28T01:40:10Z)
STPNet: Scale-aware Text Prompt Network for Medical Image Segmentation [8.812162673772459]
We propose a Scale-language Text Prompt Network that leverages vision-aware modeling to enhance medical image segmentation.<n>Our approach utilizes multi-scale textual descriptions to guide lesion localization and employs retrieval-segmentation joint learning.<n>We evaluate our vision-language approach on three datasets: COVID-Xray, COVID-CT, and Kvasir-SEG.
arXiv Detail & Related papers (2025-04-02T10:01:42Z)
FlowSDF: Flow Matching for Medical Image Segmentation Using Distance Transforms [60.195642571004804]
We introduce FlowSDF, an image-guided conditional flow matching framework, to represent an implicit distribution of segmentation masks.<n>Our framework enables accurate sampling of segmentation masks and the computation of relevant statistical measures.
arXiv Detail & Related papers (2024-05-28T11:47:12Z)
OTCXR: Rethinking Self-supervised Alignment using Optimal Transport for Chest X-ray Analysis [6.4136876268620115]
Self-supervised learning (SSL) has emerged as a promising technique for analyzing medical modalities such as X-rays.<n>We propose OTCXR, a novel SSL framework that leverages optimal transport (OT) to learn dense semantic invariance.<n>We validate OTCXR's efficacy through comprehensive experiments on three publicly available chest X-ray datasets.
arXiv Detail & Related papers (2024-04-18T02:59:48Z)
Uncertainty-aware Medical Diagnostic Phrase Identification and Grounding [72.18719355481052]
We introduce a novel task called Medical Report Grounding (MRG)<n>MRG aims to directly identify diagnostic phrases and their corresponding grounding boxes from medical reports in an end-to-end manner.<n>We propose uMedGround, a robust and reliable framework that leverages a multimodal large language model to predict diagnostic phrases.
arXiv Detail & Related papers (2024-04-10T07:41:35Z)
Language Guided Domain Generalized Medical Image Segmentation [68.93124785575739]
Single source domain generalization holds promise for more reliable and consistent image segmentation across real-world clinical settings. We propose an approach that explicitly leverages textual information by incorporating a contrastive learning mechanism guided by the text encoder features. Our approach achieves favorable performance against existing methods in literature.
arXiv Detail & Related papers (2024-04-01T17:48:15Z)
Exploring Transfer Learning in Medical Image Segmentation using Vision-Language Models [0.8878802873945023]
This study introduces the first systematic study on transferring Vision-Language Models to 2D medical images. Although VLSMs show competitive performance compared to image-only models for segmentation, not all VLSMs utilize the additional information from language prompts.
arXiv Detail & Related papers (2023-08-15T11:28:21Z)
Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities. We explicitly account for prior images and reports when available during both training and fine-tuning. Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z)
Cross-level Contrastive Learning and Consistency Constraint for Semi-supervised Medical Image Segmentation [46.678279106837294]
We propose a cross-level constrastive learning scheme to enhance representation capacity for local features in semi-supervised medical image segmentation. With the help of the cross-level contrastive learning and consistency constraint, the unlabelled data can be effectively explored to improve segmentation performance.
arXiv Detail & Related papers (2022-02-08T15:12:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.