Making medical vision-language models think causally across modalities with retrieval-augmented cross-modal reasoning
- URL: http://arxiv.org/abs/2601.18356v1
- Date: Mon, 26 Jan 2026 11:03:00 GMT
- Title: Making medical vision-language models think causally across modalities with retrieval-augmented cross-modal reasoning
- Authors: Weiqin Yang, Haowen Xue, Qingyi Peng, Hexuan Hu, Qian Huang, Tingbo Zhang,
- Abstract summary: Medical vision-language models (VLMs) achieve strong performance in diagnostic reporting and image-text alignment.<n>Their underlying reasoning mechanisms remain fundamentally correlational, exhibiting reliance on superficial statistical associations.<n>We propose Multimodal Causal Retrieval-Augmented Generation, a framework that integrates causal inference principles with multimodal retrieval.
- Score: 16.243806723551454
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Medical vision-language models (VLMs) achieve strong performance in diagnostic reporting and image-text alignment, yet their underlying reasoning mechanisms remain fundamentally correlational, exhibiting reliance on superficial statistical associations that fail to capture the causal pathophysiological mechanisms central to clinical decision-making. This limitation makes them fragile, prone to hallucinations, and sensitive to dataset biases. Retrieval-augmented generation (RAG) offers a partial remedy by grounding predictions in external knowledge. However, conventional RAG depends on semantic similarity, introducing new spurious correlations. We propose Multimodal Causal Retrieval-Augmented Generation, a framework that integrates causal inference principles with multimodal retrieval. It retrieves clinically relevant exemplars and causal graphs from external sources, conditioning model reasoning on counterfactual and interventional evidence rather than correlations alone. Applied to radiology report generation, diagnosis prediction, and visual question answering, it improves factual accuracy, robustness to distribution shifts, and interpretability. Our results highlight causal retrieval as a scalable path toward medical VLMs that think beyond pattern matching, enabling trustworthy multimodal reasoning in high-stakes clinical settings.
Related papers
- Causal Graph Neural Networks for Healthcare [2.446787923076599]
Causal graph neural networks address this triple crisis of distribution shift, discrimination, and inscrutability.<n>This Review examines methodological foundations spanning structural causal models, disentangled causal representation learning, and techniques for interventional prediction and counterfactual reasoning on graphs.
arXiv Detail & Related papers (2025-11-04T12:34:46Z) - MedAlign: A Synergistic Framework of Multimodal Preference Optimization and Federated Meta-Cognitive Reasoning [52.064286116035134]
We develop MedAlign, a framework to ensure visually accurate LVLM responses for Medical Visual Question Answering (Med-VQA)<n>We first propose a multimodal Direct Preference Optimization (mDPO) objective to align preference learning with visual context.<n>We then design a Retrieval-Aware Mixture-of-Experts (RA-MoE) architecture that utilizes image and text similarity to route queries to a specialized and context-augmented LVLM.
arXiv Detail & Related papers (2025-10-24T02:11:05Z) - Cross-modal Causal Intervention for Alzheimer's Disease Prediction [13.584994367762398]
We propose a visual-language causality-inspired framework named Cross-modal Causal Intervention with Mediator for Alzheimer's Disease Diagnosis (MediAD)<n>Our framework implicitly mitigates the effect of both observable and unobservable confounders through a unified causal intervention method.
arXiv Detail & Related papers (2025-07-18T14:21:24Z) - Predictive Causal Inference via Spatio-Temporal Modeling and Penalized Empirical Likelihood [0.0]
This study introduces an integrated framework for predictive causal inference designed to overcome limitations in conventional single model approaches.<n> Specifically, we combine a Hidden Markov Model for spatial health state estimation with a Multi Task and Multi Graph Convolutional Network (MTGCN) for capturing temporal outcome trajectories.<n>To demonstrate its utility, we focus on clinical domains such as cancer, dementia, Parkinson disease, where treatment effects are challenging to observe directly.
arXiv Detail & Related papers (2025-07-11T03:11:15Z) - Causal Disentanglement for Robust Long-tail Medical Image Generation [80.15257897500578]
We propose a novel medical image generation framework, which generates independent pathological and structural features.<n>We leverage a diffusion model guided by pathological findings to model pathological features, enabling the generation of diverse counterfactual images.
arXiv Detail & Related papers (2025-04-20T01:54:18Z) - Quantifying Symptom Causality in Clinical Decision Making: An Exploration Using CausaLM [0.0]
Current machine learning approaches to medical diagnosis often rely on correlational patterns between symptoms and diseases.<n>In this work, we move beyond correlation to investigate the causal influence of key symptoms-specifically "chest pain" on diagnostic predictions.
arXiv Detail & Related papers (2025-03-25T06:59:21Z) - Deep State-Space Generative Model For Correlated Time-to-Event Predictions [54.3637600983898]
We propose a deep latent state-space generative model to capture the interactions among different types of correlated clinical events.
Our method also uncovers meaningful insights about the latent correlations among mortality and different types of organ failures.
arXiv Detail & Related papers (2024-07-28T02:42:36Z) - Rethinking Radiology Report Generation via Causal Inspired Counterfactual Augmentation [11.266364967223556]
Radiology Report Generation (RRG) draws attention as a vision-and-language interaction of biomedical fields.
Previous works inherited the ideology of traditional language generation tasks, aiming to generate paragraphs with high readability as reports.
Despite significant progress, the independence between diseases-a specific property of RRG-was neglected, yielding the models being confused by the co-occurrence of diseases brought on by the biased data distribution.
arXiv Detail & Related papers (2023-11-22T10:55:36Z) - Identifiable Latent Polynomial Causal Models Through the Lens of Change [82.14087963690561]
Causal representation learning aims to unveil latent high-level causal representations from observed low-level data.<n>One of its primary tasks is to provide reliable assurance of identifying these latent causal models, known as identifiability.
arXiv Detail & Related papers (2023-10-24T07:46:10Z) - Cross-Modal Causal Intervention for Medical Report Generation [107.76649943399168]
Radiology Report Generation (RRG) is essential for computer-aided diagnosis and medication guidance.<n> generating accurate lesion descriptions remains challenging due to spurious correlations from visual-linguistic biases.<n>We propose a two-stage framework named CrossModal Causal Representation Learning (CMCRL)<n> Experiments on IU-Xray and MIMIC-CXR show that our CMCRL pipeline significantly outperforms state-of-the-art methods.
arXiv Detail & Related papers (2023-03-16T07:23:55Z) - Bayesian Networks for the robust and unbiased prediction of depression
and its symptoms utilizing speech and multimodal data [65.28160163774274]
We apply a Bayesian framework to capture the relationships between depression, depression symptoms, and features derived from speech, facial expression and cognitive game data collected at thymia.
arXiv Detail & Related papers (2022-11-09T14:48:13Z) - ACRE: Abstract Causal REasoning Beyond Covariation [90.99059920286484]
We introduce the Abstract Causal REasoning dataset for systematic evaluation of current vision systems in causal induction.
Motivated by the stream of research on causal discovery in Blicket experiments, we query a visual reasoning system with the following four types of questions in either an independent scenario or an interventional scenario.
We notice that pure neural models tend towards an associative strategy under their chance-level performance, whereas neuro-symbolic combinations struggle in backward-blocking reasoning.
arXiv Detail & Related papers (2021-03-26T02:42:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.