Related papers: Concept-Enhanced Multimodal RAG: Towards Interpretable and Accurate Radiology Report Generation

Concept-Enhanced Multimodal RAG: Towards Interpretable and Accurate Radiology Report Generation

URL: http://arxiv.org/abs/2602.15650v1
Date: Tue, 17 Feb 2026 15:18:07 GMT
Title: Concept-Enhanced Multimodal RAG: Towards Interpretable and Accurate Radiology Report Generation
Authors: Marco Salmè, Federico Siciliano, Fabrizio Silvestri, Paolo Soda, Rosa Sicilia, Valerio Guarrasi,
Abstract summary: Radiology Report Generation through Vision-Language Models (VLMs) promises to reduce documentation burden, improve reporting consistency, and accelerate clinical adoption.<n>Existing research treats interpretability and accuracy as separate objectives, with concept-based explainability techniques focusing primarily on transparency.<n>We present Concept-Enhanced Multimodal RAG (CEMRAG), a unified framework that decomposes visual representations into interpretable clinical concepts.
Score: 12.226029763256962
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Radiology Report Generation (RRG) through Vision-Language Models (VLMs) promises to reduce documentation burden, improve reporting consistency, and accelerate clinical workflows. However, their clinical adoption remains limited by the lack of interpretability and the tendency to hallucinate findings misaligned with imaging evidence. Existing research typically treats interpretability and accuracy as separate objectives, with concept-based explainability techniques focusing primarily on transparency, while Retrieval-Augmented Generation (RAG) methods targeting factual grounding through external retrieval. We present Concept-Enhanced Multimodal RAG (CEMRAG), a unified framework that decomposes visual representations into interpretable clinical concepts and integrates them with multimodal RAG. This approach exploits enriched contextual prompts for RRG, improving both interpretability and factual accuracy. Experiments on MIMIC-CXR and IU X-Ray across multiple VLM architectures, training regimes, and retrieval configurations demonstrate consistent improvements over both conventional RAG and concept-only baselines on clinical accuracy metrics and standard NLP measures. These results challenge the assumed trade-off between interpretability and performance, showing that transparent visual concepts can enhance rather than compromise diagnostic accuracy in medical VLMs. Our modular design decomposes interpretability into visual transparency and structured language model conditioning, providing a principled pathway toward clinically trustworthy AI-assisted radiology.

Related papers

Interpretable Unsupervised Deformable Image Registration via Confidence-bound Multi-Hop Visual Reasoning [1.6939372704265414]
Unsupervised deformable image registration requires aligning complex anatomical structures without reference labels.<n>Existing deep learning methods achieve considerable accuracy but often lack transparency, leading to error drift and reduced clinical trust.<n>We propose a novel Multi-Hop Visual Chain of Reasoning framework that reformulates registration as a progressive reasoning process.
arXiv Detail & Related papers (2026-01-30T14:41:19Z)
AgentsEval: Clinically Faithful Evaluation of Medical Imaging Reports via Multi-Agent Reasoning [73.50200033931148]
We introduce AgentsEval, a multi-agent stream reasoning framework that emulates the collaborative diagnostic workflow of radiologists.<n>By dividing the evaluation process into interpretable steps including criteria definition, evidence extraction, alignment, and consistency scoring, AgentsEval provides explicit reasoning traces and structured clinical feedback.<n> Experimental results demonstrate that AgentsEval delivers clinically aligned, semantically faithful, and interpretable evaluations that remain robust under paraphrastic, semantic, and stylistic perturbations.
arXiv Detail & Related papers (2026-01-23T11:59:13Z)
Aligning Findings with Diagnosis: A Self-Consistent Reinforcement Learning Framework for Trustworthy Radiology Reporting [37.57009831483529]
Multimodal Large Language Models (MLLMs) have shown strong potential for radiology report generation.<n>Our framework restructures generation into two distinct components: a think block for detailed findings and an answer block for structured disease labels.
arXiv Detail & Related papers (2026-01-06T14:17:44Z)
MedAlign: A Synergistic Framework of Multimodal Preference Optimization and Federated Meta-Cognitive Reasoning [52.064286116035134]
We develop MedAlign, a framework to ensure visually accurate LVLM responses for Medical Visual Question Answering (Med-VQA)<n>We first propose a multimodal Direct Preference Optimization (mDPO) objective to align preference learning with visual context.<n>We then design a Retrieval-Aware Mixture-of-Experts (RA-MoE) architecture that utilizes image and text similarity to route queries to a specialized and context-augmented LVLM.
arXiv Detail & Related papers (2025-10-24T02:11:05Z)
Self-Supervised Anatomical Consistency Learning for Vision-Grounded Medical Report Generation [61.350584471060756]
Vision-grounded medical report generation aims to produce clinically accurate descriptions of medical images.<n>We propose Self-Supervised Anatomical Consistency Learning (SS-ACL) to align generated reports with corresponding anatomical regions.<n>SS-ACL constructs a hierarchical anatomical graph inspired by the invariant top-down inclusion structure of human anatomy.
arXiv Detail & Related papers (2025-09-30T08:59:06Z)
RAD: Towards Trustworthy Retrieval-Augmented Multi-modal Clinical Diagnosis [56.373297358647655]
Retrieval-Augmented Diagnosis (RAD) is a novel framework that injects external knowledge into multimodal models directly on downstream tasks.<n>RAD operates through three key mechanisms: retrieval and refinement of disease-centered knowledge from multiple medical sources, a guideline-enhanced contrastive loss transformer, and a dual decoder.
arXiv Detail & Related papers (2025-09-24T10:36:14Z)
Interpretable Clinical Classification with Kolgomorov-Arnold Networks [70.72819760172744]
Kolmogorov-Arnold Networks (KANs) offer intrinsic interpretability through transparent, symbolic representations.<n>KANs support built-in patient-level insights, intuitive visualizations, and nearest-patient retrieval.<n>These results position KANs as a promising step toward trustworthy AI that clinicians can understand, audit, and act upon.
arXiv Detail & Related papers (2025-09-20T17:21:58Z)
Towards Interpretable Renal Health Decline Forecasting via Multi-LMM Collaborative Reasoning Framework [12.732588046754783]
We propose a collaborative framework that enhances the performance of open-source LMMs for eGFR forecasting.<n>It incorporates visual knowledge transfer, abductive reasoning, and a short-term memory mechanism to enhance prediction accuracy and interpretability.<n>Our method sheds new light on building AI systems for healthcare that combine predictive accuracy with clinically grounded interpretability.
arXiv Detail & Related papers (2025-07-30T08:11:06Z)
Uncertainty-Driven Expert Control: Enhancing the Reliability of Medical Vision-Language Models [52.2001050216955]
Existing methods aim to enhance the performance of Medical Vision Language Model (MedVLM) by adjusting model structure, fine-tuning with high-quality data, or through preference fine-tuning.<n>We propose an expert-in-the-loop framework named Expert-Controlled-Free Guidance (Expert-CFG) to align MedVLM with clinical expertise without additional training.
arXiv Detail & Related papers (2025-07-12T09:03:30Z)
Knowledge-Augmented Language Models Interpreting Structured Chest X-Ray Findings [44.99833362998488]
This paper introduces CXR-TextInter, a novel framework that repurposes powerful text-centric language models for chest X-rays interpretation.<n>We augment this LLM-centric approach with an integrated medical knowledge module to enhance clinical reasoning.<n>Our work validates an alternative paradigm for medical image AI, showcasing the potential of harnessing advanced LLM capabilities.
arXiv Detail & Related papers (2025-05-03T06:18:12Z)
CBM-RAG: Demonstrating Enhanced Interpretability in Radiology Report Generation with Multi-Agent RAG and Concept Bottleneck Models [1.7042756021131187]
This paper presents an automated radiology report generation framework that combines Concept Bottleneck Models (CBMs) with a Multi-Agent Retrieval-Augmented Generation (RAG) system.<n>CBMs map chest X-ray features to human-understandable clinical concepts, enabling transparent disease classification.<n>RAG system integrates multi-agent collaboration and external knowledge to produce contextually rich, evidence-based reports.
arXiv Detail & Related papers (2025-04-29T16:14:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.