CausalCLIPSeg: Unlocking CLIP's Potential in Referring Medical Image Segmentation with Causal Intervention
- URL: http://arxiv.org/abs/2503.15949v1
- Date: Thu, 20 Mar 2025 08:46:24 GMT
- Title: CausalCLIPSeg: Unlocking CLIP's Potential in Referring Medical Image Segmentation with Causal Intervention
- Authors: Yaxiong Chen, Minghong Wei, Zixuan Zheng, Jingliang Hu, Yilei Shi, Shengwu Xiong, Xiao Xiang Zhu, Lichao Mou,
- Abstract summary: We propose CausalCLIPSeg, an end-to-end framework for referring medical image segmentation.<n>Despite not being trained on medical data, we enforce CLIP's rich semantic space onto the medical domain.<n>To mitigate confounding bias that may cause the model to learn spurious correlations, CausalCLIPSeg introduces a causal intervention module.
- Score: 30.501326915750898
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Referring medical image segmentation targets delineating lesions indicated by textual descriptions. Aligning visual and textual cues is challenging due to their distinct data properties. Inspired by large-scale pre-trained vision-language models, we propose CausalCLIPSeg, an end-to-end framework for referring medical image segmentation that leverages CLIP. Despite not being trained on medical data, we enforce CLIP's rich semantic space onto the medical domain by a tailored cross-modal decoding method to achieve text-to-pixel alignment. Furthermore, to mitigate confounding bias that may cause the model to learn spurious correlations instead of meaningful causal relationships, CausalCLIPSeg introduces a causal intervention module which self-annotates confounders and excavates causal features from inputs for segmentation judgments. We also devise an adversarial min-max game to optimize causal features while penalizing confounding ones. Extensive experiments demonstrate the state-of-the-art performance of our proposed method. Code is available at https://github.com/WUTCM-Lab/CausalCLIPSeg.
Related papers
- Multimodal Causal-Driven Representation Learning for Generalizable Medical Image Segmentation [56.52520416420957]
We propose Multimodal Causal-Driven Representation Learning (MCDRL) to tackle domain generalization in medical image segmentation.<n>MCDRL consistently outperforms competing methods, yielding superior segmentation accuracy and exhibiting robust generalizability.
arXiv Detail & Related papers (2025-08-07T03:41:41Z) - Cycle Context Verification for In-Context Medical Image Segmentation [43.416111396585165]
In-context learning (ICL) is emerging as a promising technique for achieving universal medical image segmentation.<n>In a clinical scenario, the scarcity of annotated medical images makes it challenging to select optimal in-context pairs.<n>We propose Cycle Context Verification (CCV), a novel framework that enhances ICL-based medical image segmentation.
arXiv Detail & Related papers (2025-07-11T07:18:01Z) - MAMBO-NET: Multi-Causal Aware Modeling Backdoor-Intervention Optimization for Medical Image Segmentation Network [51.68708264694361]
Confusion factors can affect medical images, such as complex anatomical variations and imaging modality limitations.<n>We propose a multi-causal aware modeling backdoor-intervention optimization network for medical image segmentation.<n>Our method significantly reduces the influence of confusion factors, leading to enhanced segmentation accuracy.
arXiv Detail & Related papers (2025-05-28T01:40:10Z) - DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception [21.87721909270275]
DeCLIP is a novel framework that enhances CLIP with content'' and context'' features respectively.<n>It significantly outperforms existing methods across multiple open-vocabulary dense prediction tasks.
arXiv Detail & Related papers (2025-05-07T13:46:34Z) - CLIP-IT: CLIP-based Pairing for Histology Images Classification [6.5280377968471]
Multimodal learning has shown promise in medical image analysis, combining complementary modalities like histology images and text.<n>We introduce CLIP-IT, a novel framework that relies on rich unpaired text reports, eliminating paired data requirement.<n> Experiments on histology image datasets confirm that CLIP-IT consistently improves classification accuracy over both unimodal and multimodal CLIP-based baselines.
arXiv Detail & Related papers (2025-04-22T18:14:43Z) - MedFILIP: Medical Fine-grained Language-Image Pre-training [11.894318326422054]
Existing methods struggle to accurately characterize associations between images and diseases.
MedFILIP introduces medical image-specific knowledge through contrastive learning.
For single-label, multi-label, and fine-grained classification, our model achieves state-of-the-art performance.
arXiv Detail & Related papers (2025-01-18T14:08:33Z) - Mitigating Hallucination for Large Vision Language Model by Inter-Modality Correlation Calibration Decoding [66.06337890279839]
Large vision-language models (LVLMs) have shown remarkable capabilities in visual-language understanding for downstream multi-modal tasks.<n>LVLMs still suffer from generating hallucinations in complex generation tasks, leading to inconsistencies between visual inputs and generated content.<n>We propose an Inter-Modality Correlation Decoding (IMCCD) method to mitigate hallucinations in LVLMs in a training-free manner.
arXiv Detail & Related papers (2025-01-03T17:56:28Z) - TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives [65.82577305915643]
Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations.
We show that generating hard'' negative captions via in-context learning and corresponding negative images with text-to-image generators offers a solution.
We demonstrate that our method, named TripletCLIP, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the SugarCrepe benchmark.
arXiv Detail & Related papers (2024-11-04T19:24:59Z) - Language-guided Scale-aware MedSegmentor for Lesion Segmentation in Medical Imaging [7.912408164613206]
In clinical practice, segmenting specific lesions can significantly enhance diagnostic accuracy and treatment efficiency.
We propose a novel model, Language-guided Scale-aware MedSegmentor (LSMS), which segments target lesions in medical images based on given textual expressions.
Our LSMS consistently achieves superior performance with significantly lower computational cost.
arXiv Detail & Related papers (2024-08-30T15:22:13Z) - Data Alignment for Zero-Shot Concept Generation in Dermatology AI [0.6906005491572401]
Foundation models like CLIP providing zero-shot capabilities can help alleviate this challenge.
CLIP can be fine-tuned using domain specific image-caption pairs to improve classification performance.
Our goal is to use these models to generate caption text that aligns well with both the clinical lexicon and the natural human language used in CLIP's pre-training data.
arXiv Detail & Related papers (2024-04-19T17:57:29Z) - OTCXR: Rethinking Self-supervised Alignment using Optimal Transport for Chest X-ray Analysis [6.4136876268620115]
Self-supervised learning (SSL) has emerged as a promising technique for analyzing medical modalities such as X-rays.<n>We propose OTCXR, a novel SSL framework that leverages optimal transport (OT) to learn dense semantic invariance.<n>We validate OTCXR's efficacy through comprehensive experiments on three publicly available chest X-ray datasets.
arXiv Detail & Related papers (2024-04-18T02:59:48Z) - A Closer Look at the Explainability of Contrastive Language-Image Pre-training [16.10032166963232]
Contrastive language-image pre-training (CLIP) is a powerful vision-language model that has shown great benefits for various tasks.
We have identified some issues with its explainability, which undermine its credibility and limit the capacity for related tasks.
We propose the CLIP Surgery for reliable CAM, a method that allows surgery-like modifications to the inference architecture and features.
arXiv Detail & Related papers (2023-04-12T07:16:55Z) - Cross-Modal Causal Intervention for Medical Report Generation [109.83549148448469]
Medical report generation (MRG) is essential for computer-aided diagnosis and medication guidance.
Due to the spurious correlations within image-text data induced by visual and linguistic biases, it is challenging to generate accurate reports reliably describing lesion areas.
We propose a novel Visual-Linguistic Causal Intervention (VLCI) framework for MRG, which consists of a visual deconfounding module (VDM) and a linguistic deconfounding module (LDM)
arXiv Detail & Related papers (2023-03-16T07:23:55Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - Cross-level Contrastive Learning and Consistency Constraint for
Semi-supervised Medical Image Segmentation [46.678279106837294]
We propose a cross-level constrastive learning scheme to enhance representation capacity for local features in semi-supervised medical image segmentation.
With the help of the cross-level contrastive learning and consistency constraint, the unlabelled data can be effectively explored to improve segmentation performance.
arXiv Detail & Related papers (2022-02-08T15:12:11Z) - DenseCLIP: Extract Free Dense Labels from CLIP [130.3830819077699]
Contrastive Language-Image Pre-training (CLIP) has made a remarkable breakthrough in open-vocabulary zero-shot image recognition.
DenseCLIP+ surpasses SOTA transductive zero-shot semantic segmentation methods by large margins.
Our finding suggests that DenseCLIP can serve as a new reliable source of supervision for dense prediction tasks.
arXiv Detail & Related papers (2021-12-02T09:23:01Z) - Proactive Pseudo-Intervention: Causally Informed Contrastive Learning
For Interpretable Vision Models [103.64435911083432]
We present a novel contrastive learning strategy called it Proactive Pseudo-Intervention (PPI)
PPI leverages proactive interventions to guard against image features with no causal relevance.
We also devise a novel causally informed salience mapping module to identify key image pixels to intervene, and show it greatly facilitates model interpretability.
arXiv Detail & Related papers (2020-12-06T20:30:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.