XAI-CLIP: ROI-Guided Perturbation Framework for Explainable Medical Image Segmentation in Multimodal Vision-Language Models
- URL: http://arxiv.org/abs/2602.07017v1
- Date: Sun, 01 Feb 2026 00:27:06 GMT
- Title: XAI-CLIP: ROI-Guided Perturbation Framework for Explainable Medical Image Segmentation in Multimodal Vision-Language Models
- Authors: Thuraya Alzubaidi, Sana Ammar, Maryam Alsharqi, Islem Rekik, Muzammil Behzad,
- Abstract summary: XAI-CLIP is an ROI-guided perturbation framework for medical image segmentation.<n>It integrates language-informed region localization with medical image segmentation and applying targeted, region-aware perturbations.<n>XAI-CLIP achieves up to a 60% reduction in runtime, a 44.6% improvement in dice score, and a 96.7% increase in Intersection-over-Union.
- Score: 4.5236257764997205
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Medical image segmentation is a critical component of clinical workflows, enabling accurate diagnosis, treatment planning, and disease monitoring. However, despite the superior performance of transformer-based models over convolutional architectures, their limited interpretability remains a major obstacle to clinical trust and deployment. Existing explainable artificial intelligence (XAI) techniques, including gradient-based saliency methods and perturbation-based approaches, are often computationally expensive, require numerous forward passes, and frequently produce noisy or anatomically irrelevant explanations. To address these limitations, we propose XAI-CLIP, an ROI-guided perturbation framework that leverages multimodal vision-language model embeddings to localize clinically meaningful anatomical regions and guide the explanation process. By integrating language-informed region localization with medical image segmentation and applying targeted, region-aware perturbations, the proposed method generates clearer, boundary-aware saliency maps while substantially reducing computational overhead. Experiments conducted on the FLARE22 and CHAOS datasets demonstrate that XAI-CLIP achieves up to a 60\% reduction in runtime, a 44.6\% improvement in dice score, and a 96.7\% increase in Intersection-over-Union for occlusion-based explanations compared to conventional perturbation methods. Qualitative results further confirm cleaner and more anatomically consistent attribution maps with fewer artifacts, highlighting that the incorporation of multimodal vision-language representations into perturbation-based XAI frameworks significantly enhances both interpretability and efficiency, thereby enabling transparent and clinically deployable medical image segmentation systems.
Related papers
- Uncertainty-Aware Vision-Language Segmentation for Medical Imaging [12.545486211087791]
We introduce a novel uncertainty-aware multimodal segmentation framework for medical diagnosis.<n>We propose a Modality Decoding Attention Block (MoDAB) with a lightweight State Space Mixer (SSMix) to enable efficient cross-modal fusion.<n>Our results highlight the importance of incorporating uncertainty modelling and structured modality alignment in vision-language medical segmentation tasks.
arXiv Detail & Related papers (2026-02-16T06:27:51Z) - Structure-constrained Language-informed Diffusion Model for Unpaired Low-dose Computed Tomography Angiography Reconstruction [72.80209358480424]
overdose of iodinated contrast media (ICM) can cause kidney damage and life-threatening allergic reactions.<n>Deep learning methods can generate CT images of normal-dose ICM from low-dose ICM, reducing the required dose.<n>We propose a Structure-constrained Language-informed Diffusion Model (SLDM) that integrates structural synergy and spatial intelligence.
arXiv Detail & Related papers (2026-01-28T06:54:06Z) - Anatomical Region-Guided Contrastive Decoding: A Plug-and-Play Strategy for Mitigating Hallucinations in Medical VLMs [20.507007953026346]
Anatomical Region-Guided Contrastive Decoding (ARCD) is a plug-and-play strategy that mitigates hallucinations by providing targeted, region-specific guidance.<n>Our method is effective in improving regional understanding, reducing hallucinations, and enhancing overall diagnostic accuracy.
arXiv Detail & Related papers (2025-12-19T03:11:20Z) - MIRNet: Integrating Constrained Graph-Based Reasoning with Pre-training for Diagnostic Medical Imaging [67.74482877175797]
MIRNet is a novel framework that integrates self-supervised pre-training with constrained graph-based reasoning.<n>We introduce TongueAtlas-4K, a benchmark comprising 4,000 images annotated with 22 diagnostic labels.
arXiv Detail & Related papers (2025-11-13T06:30:41Z) - Self-Supervised Anatomical Consistency Learning for Vision-Grounded Medical Report Generation [61.350584471060756]
Vision-grounded medical report generation aims to produce clinically accurate descriptions of medical images.<n>We propose Self-Supervised Anatomical Consistency Learning (SS-ACL) to align generated reports with corresponding anatomical regions.<n>SS-ACL constructs a hierarchical anatomical graph inspired by the invariant top-down inclusion structure of human anatomy.
arXiv Detail & Related papers (2025-09-30T08:59:06Z) - DiSSECT: Structuring Transfer-Ready Medical Image Representations through Discrete Self-Supervision [9.254163621425727]
DiSSECT is a framework that integrates multi-scale vector quantization into the SSL pipeline to impose a discrete representational bottleneck.<n>It achieves strong performance on both classification and segmentation tasks, requiring minimal or no fine-tuning.<n>We validate DiSSECT across multiple public medical imaging datasets, demonstrating its robustness and generalizability.
arXiv Detail & Related papers (2025-09-23T07:58:21Z) - GEMeX-RMCoT: An Enhanced Med-VQA Dataset for Region-Aware Multimodal Chain-of-Thought Reasoning [60.03671205298294]
Medical visual question answering aims to support clinical decision-making by enabling models to answer natural language questions based on medical images.<n>Current methods still suffer from limited answer reliability and poor interpretability.<n>This work first proposes a Region-Aware Multimodal Chain-of-Thought dataset, in which the process of producing an answer is preceded by a sequence of intermediate reasoning steps.
arXiv Detail & Related papers (2025-06-22T08:09:58Z) - MAMBO-NET: Multi-Causal Aware Modeling Backdoor-Intervention Optimization for Medical Image Segmentation Network [51.68708264694361]
Confusion factors can affect medical images, such as complex anatomical variations and imaging modality limitations.<n>We propose a multi-causal aware modeling backdoor-intervention optimization network for medical image segmentation.<n>Our method significantly reduces the influence of confusion factors, leading to enhanced segmentation accuracy.
arXiv Detail & Related papers (2025-05-28T01:40:10Z) - Federated Learning for Coronary Artery Plaque Detection in Atherosclerosis Using IVUS Imaging: A Multi-Hospital Collaboration [8.358846277772779]
Traditional interpretation of Intravascular Ultrasound (IVUS) images during Percutaneous Coronary Intervention ( PCI) is time-intensive and inconsistent.<n>A parallel 2D U-Net model with a multi-stage segmentation architecture has been developed to enable secure data analysis across institutions.<n>A Dice Similarity Coefficient (DSC) of 0.706, the model effectively identifies plaques and detects circular boundaries in real-time.
arXiv Detail & Related papers (2024-12-19T13:06:28Z) - Augmentation is AUtO-Net: Augmentation-Driven Contrastive Multiview
Learning for Medical Image Segmentation [3.1002416427168304]
This thesis focuses on retinal blood vessel segmentation tasks.
It provides an extensive literature review of deep learning-based medical image segmentation approaches.
It proposes a novel efficient, simple multiview learning framework.
arXiv Detail & Related papers (2023-11-02T06:31:08Z) - Explaining Clinical Decision Support Systems in Medical Imaging using
Cycle-Consistent Activation Maximization [112.2628296775395]
Clinical decision support using deep neural networks has become a topic of steadily growing interest.
clinicians are often hesitant to adopt the technology because its underlying decision-making process is considered to be intransparent and difficult to comprehend.
We propose a novel decision explanation scheme based on CycleGAN activation which generates high-quality visualizations of classifier decisions even in smaller data sets.
arXiv Detail & Related papers (2020-10-09T14:39:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.