Related papers: Expert Knowledge-Aware Image Difference Graph Representation Learning for Difference-Aware Medical Visual Question Answering

Expert Knowledge-Aware Image Difference Graph Representation Learning for Difference-Aware Medical Visual Question Answering

URL: http://arxiv.org/abs/2307.11986v1
Date: Sat, 22 Jul 2023 05:34:18 GMT
Title: Expert Knowledge-Aware Image Difference Graph Representation Learning for Difference-Aware Medical Visual Question Answering
Authors: Xinyue Hu, Lin Gu, Qiyuan An, Mengliang Zhang, Liangchen Liu, Kazuma Kobayashi, Tatsuya Harada, Ronald M. Summers, Yingying Zhu
Abstract summary: Given a pair of main and reference images, this task attempts to answer several questions on both diseases. We collect a new dataset, namely MIMIC-Diff-VQA, including 700,703 QA pairs from 164,324 pairs of main and reference images.
Score: 44.897116657726365
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: To contribute to automating the medical vision-language model, we propose a novel Chest-Xray Difference Visual Question Answering (VQA) task. Given a pair of main and reference images, this task attempts to answer several questions on both diseases and, more importantly, the differences between them. This is consistent with the radiologist's diagnosis practice that compares the current image with the reference before concluding the report. We collect a new dataset, namely MIMIC-Diff-VQA, including 700,703 QA pairs from 164,324 pairs of main and reference images. Compared to existing medical VQA datasets, our questions are tailored to the Assessment-Diagnosis-Intervention-Evaluation treatment procedure used by clinical professionals. Meanwhile, we also propose a novel expert knowledge-aware graph representation learning model to address this task. The proposed baseline model leverages expert knowledge such as anatomical structure prior, semantic, and spatial knowledge to construct a multi-relationship graph, representing the image differences between two images for the image difference VQA task. The dataset and code can be found at https://github.com/Holipori/MIMIC-Diff-VQA. We believe this work would further push forward the medical vision language model.

Related papers

Querying GI Endoscopy Images: A VQA Approach [0.0]
VQA (Visual Question Answering) combines Natural Language Processing (NLP) with image understanding to answer questions about a given image.<n>This study is a submission for ImageCLEFmed-MEDVQA-GI 2025 subtask 1 that explores the adaptation of the Florence2 model to answer medical visual questions on GI endoscopy images.
arXiv Detail & Related papers (2025-07-25T13:03:46Z)
RadIR: A Scalable Framework for Multi-Grained Medical Image Retrieval via Radiology Report Mining [64.66825253356869]
We propose a novel methodology that leverages dense radiology reports to define image-wise similarity ordering at multiple granularities.<n>We construct two comprehensive medical imaging retrieval datasets: MIMIC-IR for Chest X-rays and CTRATE-IR for CT scans.<n>We develop two retrieval systems, RadIR-CXR and model-ChestCT, which demonstrate superior performance in traditional image-image and image-report retrieval tasks.
arXiv Detail & Related papers (2025-03-06T17:43:03Z)
Pretraining Vision-Language Model for Difference Visual Question Answering in Longitudinal Chest X-rays [6.351190845487287]
Difference visual question answering (diff-VQA) is a challenging task that requires answering complex questions based on differences between a pair of images. Previous works focused on designing specific network architectures for the diff-VQA task, missing opportunities to enhance the model's performance. Here, we introduce a novel VLM called PLURAL, which is pretrained on natural and longitudinal chest X-ray data for the diff-VQA task.
arXiv Detail & Related papers (2024-02-14T06:20:48Z)
XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models [60.437091462613544]
We introduce XrayGPT, a novel conversational medical vision-language model. It can analyze and answer open-ended questions about chest radiographs. We generate 217k interactive and high-quality summaries from free-text radiology reports.
arXiv Detail & Related papers (2023-06-13T17:59:59Z)
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering [56.25766322554655]
Medical Visual Question Answering (MedVQA) presents a significant opportunity to enhance diagnostic accuracy and healthcare delivery. We propose a generative-based model for medical visual understanding by aligning visual information from a pre-trained vision encoder with a large language model. We train the proposed model on PMC-VQA and then fine-tune it on multiple public benchmarks, e.g., VQA-RAD, SLAKE, and Image-Clef 2019.
arXiv Detail & Related papers (2023-05-17T17:50:16Z)
Pixel-Level Explanation of Multiple Instance Learning Models in Biomedical Single Cell Images [52.527733226555206]
We investigate the use of four attribution methods to explain a multiple instance learning models. We study two datasets of acute myeloid leukemia with over 100 000 single cell images. We compare attribution maps with the annotations of a medical expert to see how the model's decision-making differs from the human standard.
arXiv Detail & Related papers (2023-03-15T14:00:11Z)
Medical visual question answering using joint self-supervised learning [8.817054025763325]
The encoder embeds across the image-text dual modalities with self-attention mechanism. The decoder is connected to the top of the encoder and fine-tuned using the small-sized medical VQA dataset.
arXiv Detail & Related papers (2023-02-25T12:12:22Z)
Interpretable Medical Image Visual Question Answering via Multi-Modal Relationship Graph Learning [45.746882253686856]
Medical visual question answering (VQA) aims to answer clinically relevant questions regarding input medical images. We first collected a comprehensive and large-scale medical VQA dataset, focusing on chest X-ray images. Based on this dataset, we also propose a novel baseline method by constructing three different relationship graphs.
arXiv Detail & Related papers (2023-02-19T17:46:16Z)
Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities. We explicitly account for prior images and reports when available during both training and fine-tuning. Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z)
Self-supervised vision-language pretraining for Medical visual question answering [9.073820229958054]
We propose a self-supervised method that applies Masked image modeling, Masked language modeling, Image text matching and Image text alignment via contrastive learning (M2I2) for pretraining. The proposed method achieves state-of-the-art performance on all the three public medical VQA datasets.
arXiv Detail & Related papers (2022-11-24T13:31:56Z)
MuVAM: A Multi-View Attention-based Model for Medical Visual Question Answering [2.413694065650786]
This paper proposes a multi-view attention-based model(MuVAM) for medical visual question answering. It integrates the high-level semantics of medical images on the basis of text description. Experiments on two datasets show that the effectiveness of MuVAM surpasses the state-of-the-art method.
arXiv Detail & Related papers (2021-07-07T13:40:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.