Grounding Chest X-Ray Visual Question Answering with Generated Radiology Reports
- URL: http://arxiv.org/abs/2505.16624v1
- Date: Thu, 22 May 2025 12:57:35 GMT
- Title: Grounding Chest X-Ray Visual Question Answering with Generated Radiology Reports
- Authors: Francesco Dalla Serra, Patrick Schrempf, Chaoyang Wang, Zaiqiao Meng, Fani Deligianni, Alison Q. O'Neil,
- Abstract summary: We present a novel approach to Chest X-ray (CXR) Visual Question Answering (VQA)<n>Single-image questions focus on abnormalities within a specific CXR, while image-difference questions compare two longitudinal CXRs acquired at different time points.<n>We extend this idea by showing that the reports can also be leveraged as additional input to improve the VQA model's predicted answers.
- Score: 19.320173724978815
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We present a novel approach to Chest X-ray (CXR) Visual Question Answering (VQA), addressing both single-image image-difference questions. Single-image questions focus on abnormalities within a specific CXR ("What abnormalities are seen in image X?"), while image-difference questions compare two longitudinal CXRs acquired at different time points ("What are the differences between image X and Y?"). We further explore how the integration of radiology reports can enhance the performance of VQA models. While previous approaches have demonstrated the utility of radiology reports during the pre-training phase, we extend this idea by showing that the reports can also be leveraged as additional input to improve the VQA model's predicted answers. First, we propose a unified method that handles both types of questions and auto-regressively generates the answers. For single-image questions, the model is provided with a single CXR. For image-difference questions, the model is provided with two CXRs from the same patient, captured at different time points, enabling the model to detect and describe temporal changes. Taking inspiration from 'Chain-of-Thought reasoning', we demonstrate that performance on the CXR VQA task can be improved by grounding the answer generator module with a radiology report predicted for the same CXR. In our approach, the VQA model is divided into two steps: i) Report Generation (RG) and ii) Answer Generation (AG). Our results demonstrate that incorporating predicted radiology reports as evidence to the AG model enhances performance on both single-image and image-difference questions, achieving state-of-the-art results on the Medical-Diff-VQA dataset.
Related papers
- Interpreting Chest X-rays Like a Radiologist: A Benchmark with Clinical Reasoning [18.15610003617933]
We present CXRTrek, a new multi-stage visual question answering (VQA) dataset for chest X-ray (CXR) interpretation.<n>The dataset is designed to explicitly simulate the diagnostic reasoning process employed by radiologists in real-world clinical settings.<n>We propose a new vision-language large model (VLLM), CXRTrekNet, specifically designed to incorporate the clinical reasoning flow into the framework.
arXiv Detail & Related papers (2025-05-29T06:30:40Z) - RadIR: A Scalable Framework for Multi-Grained Medical Image Retrieval via Radiology Report Mining [64.66825253356869]
We propose a novel methodology that leverages dense radiology reports to define image-wise similarity ordering at multiple granularities.<n>We construct two comprehensive medical imaging retrieval datasets: MIMIC-IR for Chest X-rays and CTRATE-IR for CT scans.<n>We develop two retrieval systems, RadIR-CXR and model-ChestCT, which demonstrate superior performance in traditional image-image and image-report retrieval tasks.
arXiv Detail & Related papers (2025-03-06T17:43:03Z) - Pretraining Vision-Language Model for Difference Visual Question Answering in Longitudinal Chest X-rays [6.351190845487287]
Difference visual question answering (diff-VQA) is a challenging task that requires answering complex questions based on differences between a pair of images.<n>Previous works focused on designing specific network architectures for the diff-VQA task, missing opportunities to enhance the model's performance.<n>Here, we introduce a novel VLM called PLURAL, which is pretrained on natural and longitudinal chest X-ray data for the diff-VQA task.
arXiv Detail & Related papers (2024-02-14T06:20:48Z) - Expert Knowledge-Aware Image Difference Graph Representation Learning for Difference-Aware Medical Visual Question Answering [45.058569118999436]
Given a pair of main and reference images, this task attempts to answer several questions on both diseases.
We collect a new dataset, namely MIMIC-Diff-VQA, including 700,703 QA pairs from 164,324 pairs of main and reference images.
arXiv Detail & Related papers (2023-07-22T05:34:18Z) - Instrumental Variable Learning for Chest X-ray Classification [52.68170685918908]
We propose an interpretable instrumental variable (IV) learning framework to eliminate the spurious association and obtain accurate causal representation.
Our approach's performance is demonstrated using the MIMIC-CXR, NIH ChestX-ray 14, and CheXpert datasets.
arXiv Detail & Related papers (2023-05-20T03:12:23Z) - Cross-Modal Causal Intervention for Medical Report Generation [107.76649943399168]
Radiology Report Generation (RRG) is essential for computer-aided diagnosis and medication guidance.<n> generating accurate lesion descriptions remains challenging due to spurious correlations from visual-linguistic biases.<n>We propose a two-stage framework named CrossModal Causal Representation Learning (CMCRL)<n> Experiments on IU-Xray and MIMIC-CXR show that our CMCRL pipeline significantly outperforms state-of-the-art methods.
arXiv Detail & Related papers (2023-03-16T07:23:55Z) - DeltaNet:Conditional Medical Report Generation for COVID-19 Diagnosis [54.93879264615525]
We propose DeltaNet to generate medical reports automatically.
DeltaNet employs three steps to generate a report.
We evaluate DeltaNet on a COVID-19 dataset, where DeltaNet outperforms state-of-the-art approaches.
arXiv Detail & Related papers (2022-11-12T07:41:03Z) - CheXRelNet: An Anatomy-Aware Model for Tracking Longitudinal
Relationships between Chest X-Rays [2.9212099078191764]
We propose CheXRelNet, a neural model that can track longitudinal pathology change relations between two Chest X-rays.
CheXRelNet incorporates local and global visual features, utilizes inter-image and intra-image anatomical information, and learns dependencies between anatomical region attributes.
arXiv Detail & Related papers (2022-08-08T02:22:09Z) - Contrastive Attention for Automatic Chest X-ray Report Generation [124.60087367316531]
In most cases, the normal regions dominate the entire chest X-ray image, and the corresponding descriptions of these normal regions dominate the final report.
We propose Contrastive Attention (CA) model, which compares the current input image with normal images to distill the contrastive information.
We achieve the state-of-the-art results on the two public datasets.
arXiv Detail & Related papers (2021-06-13T11:20:31Z) - Covid-19 Detection from Chest X-ray and Patient Metadata using Graph
Convolutional Neural Networks [6.420262246029286]
We propose a novel Graph Convolution Neural Network (GCN) that is capable of identifying bio-markers of Covid-19 pneumonia.
The proposed method exploits important relational knowledge between data instances and their features using graph representation and applies convolution to learn the graph data.
arXiv Detail & Related papers (2021-05-20T13:13:29Z) - Variational Knowledge Distillation for Disease Classification in Chest
X-Rays [102.04931207504173]
We propose itvariational knowledge distillation (VKD), which is a new probabilistic inference framework for disease classification based on X-rays.
We demonstrate the effectiveness of our method on three public benchmark datasets with paired X-ray images and EHRs.
arXiv Detail & Related papers (2021-03-19T14:13:56Z) - Convolutional-LSTM for Multi-Image to Single Output Medical Prediction [55.41644538483948]
A common scenario in developing countries is to have the volume metadata lost due multiple reasons.
It is possible to get a multi-image to single diagnostic model which mimics human doctor diagnostic process.
arXiv Detail & Related papers (2020-10-20T04:30:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.