Related papers: CausalCLIPSeg: Unlocking CLIP's Potential in Referring Medical Image Segmentation with Causal Intervention

CausalCLIPSeg: Unlocking CLIP's Potential in Referring Medical Image Segmentation with Causal Intervention

URL: http://arxiv.org/abs/2503.15949v1
Date: Thu, 20 Mar 2025 08:46:24 GMT
Title: CausalCLIPSeg: Unlocking CLIP's Potential in Referring Medical Image Segmentation with Causal Intervention
Authors: Yaxiong Chen, Minghong Wei, Zixuan Zheng, Jingliang Hu, Yilei Shi, Shengwu Xiong, Xiao Xiang Zhu, Lichao Mou,
Abstract summary: We propose CausalCLIPSeg, an end-to-end framework for referring medical image segmentation.<n>Despite not being trained on medical data, we enforce CLIP's rich semantic space onto the medical domain.<n>To mitigate confounding bias that may cause the model to learn spurious correlations, CausalCLIPSeg introduces a causal intervention module.
Score: 30.501326915750898
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Referring medical image segmentation targets delineating lesions indicated by textual descriptions. Aligning visual and textual cues is challenging due to their distinct data properties. Inspired by large-scale pre-trained vision-language models, we propose CausalCLIPSeg, an end-to-end framework for referring medical image segmentation that leverages CLIP. Despite not being trained on medical data, we enforce CLIP's rich semantic space onto the medical domain by a tailored cross-modal decoding method to achieve text-to-pixel alignment. Furthermore, to mitigate confounding bias that may cause the model to learn spurious correlations instead of meaningful causal relationships, CausalCLIPSeg introduces a causal intervention module which self-annotates confounders and excavates causal features from inputs for segmentation judgments. We also devise an adversarial min-max game to optimize causal features while penalizing confounding ones. Extensive experiments demonstrate the state-of-the-art performance of our proposed method. Code is available at https://github.com/WUTCM-Lab/CausalCLIPSeg.

Related papers

MedFILIP: Medical Fine-grained Language-Image Pre-training [11.894318326422054]
Existing methods struggle to accurately characterize associations between images and diseases. MedFILIP introduces medical image-specific knowledge through contrastive learning. For single-label, multi-label, and fine-grained classification, our model achieves state-of-the-art performance.
arXiv Detail & Related papers (2025-01-18T14:08:33Z)
Mitigating Hallucination for Large Vision Language Model by Inter-Modality Correlation Calibration Decoding [66.06337890279839]
Large vision-language models (LVLMs) have shown remarkable capabilities in visual-language understanding for downstream multi-modal tasks.<n>LVLMs still suffer from generating hallucinations in complex generation tasks, leading to inconsistencies between visual inputs and generated content.<n>We propose an Inter-Modality Correlation Decoding (IMCCD) method to mitigate hallucinations in LVLMs in a training-free manner.
arXiv Detail & Related papers (2025-01-03T17:56:28Z)
TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives [65.82577305915643]
Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations. We show that generating hard'' negative captions via in-context learning and corresponding negative images with text-to-image generators offers a solution. We demonstrate that our method, named TripletCLIP, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the SugarCrepe benchmark.
arXiv Detail & Related papers (2024-11-04T19:24:59Z)
Language-guided Scale-aware MedSegmentor for Lesion Segmentation in Medical Imaging [7.912408164613206]
In clinical practice, segmenting specific lesions can significantly enhance diagnostic accuracy and treatment efficiency. We propose a novel model, Language-guided Scale-aware MedSegmentor (LSMS), which segments target lesions in medical images based on given textual expressions. Our LSMS consistently achieves superior performance with significantly lower computational cost.
arXiv Detail & Related papers (2024-08-30T15:22:13Z)
Data Alignment for Zero-Shot Concept Generation in Dermatology AI [0.6906005491572401]
Foundation models like CLIP providing zero-shot capabilities can help alleviate this challenge. CLIP can be fine-tuned using domain specific image-caption pairs to improve classification performance. Our goal is to use these models to generate caption text that aligns well with both the clinical lexicon and the natural human language used in CLIP's pre-training data.
arXiv Detail & Related papers (2024-04-19T17:57:29Z)
OTCXR: Rethinking Self-supervised Alignment using Optimal Transport for Chest X-ray Analysis [6.4136876268620115]
Self-supervised learning (SSL) has emerged as a promising technique for analyzing medical modalities such as X-rays.<n>We propose OTCXR, a novel SSL framework that leverages optimal transport (OT) to learn dense semantic invariance.<n>We validate OTCXR's efficacy through comprehensive experiments on three publicly available chest X-ray datasets.
arXiv Detail & Related papers (2024-04-18T02:59:48Z)
A Closer Look at the Explainability of Contrastive Language-Image Pre-training [16.10032166963232]
Contrastive language-image pre-training (CLIP) is a powerful vision-language model that has shown great benefits for various tasks. We have identified some issues with its explainability, which undermine its credibility and limit the capacity for related tasks. We propose the CLIP Surgery for reliable CAM, a method that allows surgery-like modifications to the inference architecture and features.
arXiv Detail & Related papers (2023-04-12T07:16:55Z)
Cross-Modal Causal Intervention for Medical Report Generation [109.83549148448469]
Medical report generation (MRG) is essential for computer-aided diagnosis and medication guidance. Due to the spurious correlations within image-text data induced by visual and linguistic biases, it is challenging to generate accurate reports reliably describing lesion areas. We propose a novel Visual-Linguistic Causal Intervention (VLCI) framework for MRG, which consists of a visual deconfounding module (VDM) and a linguistic deconfounding module (LDM)
arXiv Detail & Related papers (2023-03-16T07:23:55Z)
Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP) We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z)
Cross-level Contrastive Learning and Consistency Constraint for Semi-supervised Medical Image Segmentation [46.678279106837294]
We propose a cross-level constrastive learning scheme to enhance representation capacity for local features in semi-supervised medical image segmentation. With the help of the cross-level contrastive learning and consistency constraint, the unlabelled data can be effectively explored to improve segmentation performance.
arXiv Detail & Related papers (2022-02-08T15:12:11Z)
DenseCLIP: Extract Free Dense Labels from CLIP [130.3830819077699]
Contrastive Language-Image Pre-training (CLIP) has made a remarkable breakthrough in open-vocabulary zero-shot image recognition. DenseCLIP+ surpasses SOTA transductive zero-shot semantic segmentation methods by large margins. Our finding suggests that DenseCLIP can serve as a new reliable source of supervision for dense prediction tasks.
arXiv Detail & Related papers (2021-12-02T09:23:01Z)
Proactive Pseudo-Intervention: Causally Informed Contrastive Learning For Interpretable Vision Models [103.64435911083432]
We present a novel contrastive learning strategy called it Proactive Pseudo-Intervention (PPI) PPI leverages proactive interventions to guard against image features with no causal relevance. We also devise a novel causally informed salience mapping module to identify key image pixels to intervene, and show it greatly facilitates model interpretability.
arXiv Detail & Related papers (2020-12-06T20:30:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.