Related papers: Structure Causal Models and LLMs Integration in Medical Visual Question Answering

Structure Causal Models and LLMs Integration in Medical Visual Question Answering

URL: http://arxiv.org/abs/2505.02703v1
Date: Mon, 05 May 2025 14:57:02 GMT
Title: Structure Causal Models and LLMs Integration in Medical Visual Question Answering
Authors: Zibo Xu, Qiang Li, Weizhi Nie, Weijie Wang, Anan Liu,
Abstract summary: We propose a causal inference framework for the MedVQA task.<n>We are the first to introduce a novel causal graph structure that represents the interaction between visual and textual elements.<n>Our method achieves true causal correlations in the face of complex medical data.
Score: 42.54219413108453
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Medical Visual Question Answering (MedVQA) aims to answer medical questions according to medical images. However, the complexity of medical data leads to confounders that are difficult to observe, so bias between images and questions is inevitable. Such cross-modal bias makes it challenging to infer medically meaningful answers. In this work, we propose a causal inference framework for the MedVQA task, which effectively eliminates the relative confounding effect between the image and the question to ensure the precision of the question-answering (QA) session. We are the first to introduce a novel causal graph structure that represents the interaction between visual and textual elements, explicitly capturing how different questions influence visual features. During optimization, we apply the mutual information to discover spurious correlations and propose a multi-variable resampling front-door adjustment method to eliminate the relative confounding effect, which aims to align features based on their true causal relevance to the question-answering task. In addition, we also introduce a prompt strategy that combines multiple prompt forms to improve the model's ability to understand complex medical data and answer accurately. Extensive experiments on three MedVQA datasets demonstrate that 1) our method significantly improves the accuracy of MedVQA, and 2) our method achieves true causal correlations in the face of complex medical data.

Related papers

Knowing or Guessing? Robust Medical Visual Question Answering via Joint Consistency and Contrastive Learning [34.6490677122246]
We show that current Medical Vision-Language Models (Med-VLMs) exhibit concerning fragility in Medical Visual Question Answering.<n>We propose Consistency and Contrastive Learning (CCL), which integrates knowledge-anchored consistency learning and bias-aware contrastive learning.<n>CCL achieves SOTA performance on three popular VQA benchmarks and notably improves answer consistency by 50% on the challenging RoMed test set.
arXiv Detail & Related papers (2025-08-26T05:21:19Z)
GEMeX-ThinkVG: Towards Thinking with Visual Grounding in Medical VQA via Reinforcement Learning [50.94508930739623]
Medical visual question answering aims to support clinical decision-making by enabling models to answer natural language questions based on medical images.<n>Current methods still suffer from limited answer reliability and poor interpretability, impairing the ability of clinicians and patients to understand and trust model-generated answers.<n>This work first proposes a Thinking with Visual Grounding dataset wherein the answer generation is decomposed into intermediate reasoning steps.<n>We introduce a novel verifiable reward mechanism for reinforcement learning to guide post-training, improving the alignment between the model's reasoning process and its final answer.
arXiv Detail & Related papers (2025-06-22T08:09:58Z)
MedCFVQA: A Causal Approach to Mitigate Modality Preference Bias in Medical Visual Question Answering [13.506155313741493]
Existing MedVQA models suffered from modality preference bias, where predictions are heavily dominated by one modality while overlooking the other.<n>We propose a Medical CounterFactual VQA (MedCFVQA) model, which trains with bias and leverages causal graphs to eliminate the modality preference bias during inference.<n>We show that MedCFVQA significantly outperforms its non-causal counterpart on both SLAKE, RadVQA and SLAKE-CP, RadVQA-CP datasets.
arXiv Detail & Related papers (2025-05-22T04:21:05Z)
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering [56.25766322554655]
Medical Visual Question Answering (MedVQA) presents a significant opportunity to enhance diagnostic accuracy and healthcare delivery. We propose a generative-based model for medical visual understanding by aligning visual information from a pre-trained vision encoder with a large language model. We train the proposed model on PMC-VQA and then fine-tune it on multiple public benchmarks, e.g., VQA-RAD, SLAKE, and Image-Clef 2019.
arXiv Detail & Related papers (2023-05-17T17:50:16Z)
Medical Question Summarization with Entity-driven Contrastive Learning [12.008269098530386]
This paper proposes a novel medical question summarization framework using entity-driven contrastive learning (ECL) ECL employs medical entities in frequently asked questions (FAQs) as focuses and devises an effective mechanism to generate hard negative samples. We find that some MQA datasets suffer from serious data leakage problems, such as the iCliniq dataset's 33% duplicate rate.
arXiv Detail & Related papers (2023-04-15T00:19:03Z)
Medical visual question answering using joint self-supervised learning [8.817054025763325]
The encoder embeds across the image-text dual modalities with self-attention mechanism. The decoder is connected to the top of the encoder and fine-tuned using the small-sized medical VQA dataset.
arXiv Detail & Related papers (2023-02-25T12:12:22Z)
Interpretable Medical Image Visual Question Answering via Multi-Modal Relationship Graph Learning [45.746882253686856]
Medical visual question answering (VQA) aims to answer clinically relevant questions regarding input medical images. We first collected a comprehensive and large-scale medical VQA dataset, focusing on chest X-ray images. Based on this dataset, we also propose a novel baseline method by constructing three different relationship graphs.
arXiv Detail & Related papers (2023-02-19T17:46:16Z)
Consistency-preserving Visual Question Answering in Medical Imaging [2.005299372367689]
Visual Question Answering (VQA) models take an image and a natural-language question as input and infer the answer to the question. We propose a novel loss function and corresponding training procedure that allows the inclusion of relations between questions into the training process. Our experiments show that our method outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2022-06-27T13:38:50Z)
MGA-VQA: Multi-Granularity Alignment for Visual Question Answering [75.55108621064726]
Learning to answer visual questions is a challenging task since the multi-modal inputs are within two feature spaces. We propose Multi-Granularity Alignment architecture for Visual Question Answering task (MGA-VQA) Our model splits alignment into different levels to achieve learning better correlations without needing additional data and annotations.
arXiv Detail & Related papers (2022-01-25T22:30:54Z)
Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in Visual Question Answering [71.6781118080461]
We propose a Graph Matching Attention (GMA) network for Visual Question Answering (VQA) task. firstly, it builds graph for the image, but also constructs graph for the question in terms of both syntactic and embedding information. Next, we explore the intra-modality relationships by a dual-stage graph encoder and then present a bilateral cross-modality graph matching attention to infer the relationships between the image and the question. Experiments demonstrate that our network achieves state-of-the-art performance on the GQA dataset and the VQA 2.0 dataset.
arXiv Detail & Related papers (2021-12-14T10:01:26Z)
MuVAM: A Multi-View Attention-based Model for Medical Visual Question Answering [2.413694065650786]
This paper proposes a multi-view attention-based model(MuVAM) for medical visual question answering. It integrates the high-level semantics of medical images on the basis of text description. Experiments on two datasets show that the effectiveness of MuVAM surpasses the state-of-the-art method.
arXiv Detail & Related papers (2021-07-07T13:40:25Z)
Interpretable Multi-Step Reasoning with Knowledge Extraction on Complex Healthcare Question Answering [89.76059961309453]
HeadQA dataset contains multiple-choice questions authorized for the public healthcare specialization exam. These questions are the most challenging for current QA systems. We present a Multi-step reasoning with Knowledge extraction framework (MurKe) We are striving to make full use of off-the-shelf pre-trained models.
arXiv Detail & Related papers (2020-08-06T02:47:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.