Related papers: XAI-CLIP: ROI-Guided Perturbation Framework for Explainable Medical Image Segmentation in Multimodal Vision-Language Models

XAI-CLIP: ROI-Guided Perturbation Framework for Explainable Medical Image Segmentation in Multimodal Vision-Language Models

URL: http://arxiv.org/abs/2602.07017v1
Date: Sun, 01 Feb 2026 00:27:06 GMT
Title: XAI-CLIP: ROI-Guided Perturbation Framework for Explainable Medical Image Segmentation in Multimodal Vision-Language Models
Authors: Thuraya Alzubaidi, Sana Ammar, Maryam Alsharqi, Islem Rekik, Muzammil Behzad,
Abstract summary: XAI-CLIP is an ROI-guided perturbation framework for medical image segmentation.<n>It integrates language-informed region localization with medical image segmentation and applying targeted, region-aware perturbations.<n>XAI-CLIP achieves up to a 60% reduction in runtime, a 44.6% improvement in dice score, and a 96.7% increase in Intersection-over-Union.
Score: 4.5236257764997205
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Medical image segmentation is a critical component of clinical workflows, enabling accurate diagnosis, treatment planning, and disease monitoring. However, despite the superior performance of transformer-based models over convolutional architectures, their limited interpretability remains a major obstacle to clinical trust and deployment. Existing explainable artificial intelligence (XAI) techniques, including gradient-based saliency methods and perturbation-based approaches, are often computationally expensive, require numerous forward passes, and frequently produce noisy or anatomically irrelevant explanations. To address these limitations, we propose XAI-CLIP, an ROI-guided perturbation framework that leverages multimodal vision-language model embeddings to localize clinically meaningful anatomical regions and guide the explanation process. By integrating language-informed region localization with medical image segmentation and applying targeted, region-aware perturbations, the proposed method generates clearer, boundary-aware saliency maps while substantially reducing computational overhead. Experiments conducted on the FLARE22 and CHAOS datasets demonstrate that XAI-CLIP achieves up to a 60\% reduction in runtime, a 44.6\% improvement in dice score, and a 96.7\% increase in Intersection-over-Union for occlusion-based explanations compared to conventional perturbation methods. Qualitative results further confirm cleaner and more anatomically consistent attribution maps with fewer artifacts, highlighting that the incorporation of multimodal vision-language representations into perturbation-based XAI frameworks significantly enhances both interpretability and efficiency, thereby enabling transparent and clinically deployable medical image segmentation systems.

Related papers

Uncertainty-Aware Vision-Language Segmentation for Medical Imaging [12.545486211087791]
We introduce a novel uncertainty-aware multimodal segmentation framework for medical diagnosis.<n>We propose a Modality Decoding Attention Block (MoDAB) with a lightweight State Space Mixer (SSMix) to enable efficient cross-modal fusion.<n>Our results highlight the importance of incorporating uncertainty modelling and structured modality alignment in vision-language medical segmentation tasks.
arXiv Detail & Related papers (2026-02-16T06:27:51Z)
Structure-constrained Language-informed Diffusion Model for Unpaired Low-dose Computed Tomography Angiography Reconstruction [72.80209358480424]
overdose of iodinated contrast media (ICM) can cause kidney damage and life-threatening allergic reactions.<n>Deep learning methods can generate CT images of normal-dose ICM from low-dose ICM, reducing the required dose.<n>We propose a Structure-constrained Language-informed Diffusion Model (SLDM) that integrates structural synergy and spatial intelligence.
arXiv Detail & Related papers (2026-01-28T06:54:06Z)
Anatomical Region-Guided Contrastive Decoding: A Plug-and-Play Strategy for Mitigating Hallucinations in Medical VLMs [20.507007953026346]
Anatomical Region-Guided Contrastive Decoding (ARCD) is a plug-and-play strategy that mitigates hallucinations by providing targeted, region-specific guidance.<n>Our method is effective in improving regional understanding, reducing hallucinations, and enhancing overall diagnostic accuracy.
arXiv Detail & Related papers (2025-12-19T03:11:20Z)
MIRNet: Integrating Constrained Graph-Based Reasoning with Pre-training for Diagnostic Medical Imaging [67.74482877175797]
MIRNet is a novel framework that integrates self-supervised pre-training with constrained graph-based reasoning.<n>We introduce TongueAtlas-4K, a benchmark comprising 4,000 images annotated with 22 diagnostic labels.
arXiv Detail & Related papers (2025-11-13T06:30:41Z)
Self-Supervised Anatomical Consistency Learning for Vision-Grounded Medical Report Generation [61.350584471060756]
Vision-grounded medical report generation aims to produce clinically accurate descriptions of medical images.<n>We propose Self-Supervised Anatomical Consistency Learning (SS-ACL) to align generated reports with corresponding anatomical regions.<n>SS-ACL constructs a hierarchical anatomical graph inspired by the invariant top-down inclusion structure of human anatomy.
arXiv Detail & Related papers (2025-09-30T08:59:06Z)
DiSSECT: Structuring Transfer-Ready Medical Image Representations through Discrete Self-Supervision [9.254163621425727]
DiSSECT is a framework that integrates multi-scale vector quantization into the SSL pipeline to impose a discrete representational bottleneck.<n>It achieves strong performance on both classification and segmentation tasks, requiring minimal or no fine-tuning.<n>We validate DiSSECT across multiple public medical imaging datasets, demonstrating its robustness and generalizability.
arXiv Detail & Related papers (2025-09-23T07:58:21Z)
GEMeX-RMCoT: An Enhanced Med-VQA Dataset for Region-Aware Multimodal Chain-of-Thought Reasoning [60.03671205298294]
Medical visual question answering aims to support clinical decision-making by enabling models to answer natural language questions based on medical images.<n>Current methods still suffer from limited answer reliability and poor interpretability.<n>This work first proposes a Region-Aware Multimodal Chain-of-Thought dataset, in which the process of producing an answer is preceded by a sequence of intermediate reasoning steps.
arXiv Detail & Related papers (2025-06-22T08:09:58Z)
MAMBO-NET: Multi-Causal Aware Modeling Backdoor-Intervention Optimization for Medical Image Segmentation Network [51.68708264694361]
Confusion factors can affect medical images, such as complex anatomical variations and imaging modality limitations.<n>We propose a multi-causal aware modeling backdoor-intervention optimization network for medical image segmentation.<n>Our method significantly reduces the influence of confusion factors, leading to enhanced segmentation accuracy.
arXiv Detail & Related papers (2025-05-28T01:40:10Z)
Federated Learning for Coronary Artery Plaque Detection in Atherosclerosis Using IVUS Imaging: A Multi-Hospital Collaboration [8.358846277772779]
Traditional interpretation of Intravascular Ultrasound (IVUS) images during Percutaneous Coronary Intervention ( PCI) is time-intensive and inconsistent.<n>A parallel 2D U-Net model with a multi-stage segmentation architecture has been developed to enable secure data analysis across institutions.<n>A Dice Similarity Coefficient (DSC) of 0.706, the model effectively identifies plaques and detects circular boundaries in real-time.
arXiv Detail & Related papers (2024-12-19T13:06:28Z)
Augmentation is AUtO-Net: Augmentation-Driven Contrastive Multiview Learning for Medical Image Segmentation [3.1002416427168304]
This thesis focuses on retinal blood vessel segmentation tasks. It provides an extensive literature review of deep learning-based medical image segmentation approaches. It proposes a novel efficient, simple multiview learning framework.
arXiv Detail & Related papers (2023-11-02T06:31:08Z)
Explaining Clinical Decision Support Systems in Medical Imaging using Cycle-Consistent Activation Maximization [112.2628296775395]
Clinical decision support using deep neural networks has become a topic of steadily growing interest. clinicians are often hesitant to adopt the technology because its underlying decision-making process is considered to be intransparent and difficult to comprehend. We propose a novel decision explanation scheme based on CycleGAN activation which generates high-quality visualizations of classifier decisions even in smaller data sets.
arXiv Detail & Related papers (2020-10-09T14:39:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.