Q2ATransformer: Improving Medical VQA via an Answer Querying Decoder
- URL: http://arxiv.org/abs/2304.01611v2
- Date: Mon, 5 Jun 2023 05:17:55 GMT
- Title: Q2ATransformer: Improving Medical VQA via an Answer Querying Decoder
- Authors: Yunyi Liu, Zhanyu Wang, Dong Xu, and Luping Zhou
- Abstract summary: We propose a new Transformer based framework for medical VQA (named as Q2ATransformer)
We introduce an additional Transformer decoder with a set of learnable candidate answer embeddings to query the existence of each answer class to a given image-question pair.
Our method achieves new state-of-the-art performance on two medical VQA benchmarks.
- Score: 39.06513668037645
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Medical Visual Question Answering (VQA) systems play a supporting role to
understand clinic-relevant information carried by medical images. The questions
to a medical image include two categories: close-end (such as Yes/No question)
and open-end. To obtain answers, the majority of the existing medical VQA
methods relies on classification approaches, while a few works attempt to use
generation approaches or a mixture of the two. The classification approaches
are relatively simple but perform poorly on long open-end questions. To bridge
this gap, in this paper, we propose a new Transformer based framework for
medical VQA (named as Q2ATransformer), which integrates the advantages of both
the classification and the generation approaches and provides a unified
treatment for the close-end and open-end questions. Specifically, we introduce
an additional Transformer decoder with a set of learnable candidate answer
embeddings to query the existence of each answer class to a given
image-question pair. Through the Transformer attention, the candidate answer
embeddings interact with the fused features of the image-question pair to make
the decision. In this way, despite being a classification-based approach, our
method provides a mechanism to interact with the answer information for
prediction like the generation-based approaches. On the other hand, by
classification, we mitigate the task difficulty by reducing the search space of
answers. Our method achieves new state-of-the-art performance on two medical
VQA benchmarks. Especially, for the open-end questions, we achieve 79.19% on
VQA-RAD and 54.85% on PathVQA, with 16.09% and 41.45% absolute improvements,
respectively.
Related papers
- PitVQA: Image-grounded Text Embedding LLM for Visual Question Answering in Pituitary Surgery [16.341966752582096]
This paper introduces PitVQA, a dataset specifically designed for Visual Question Answering (VQA) in endonasal pituitary surgery and PitVQA-Net, an adaptation of the GPT2 with a novel image-grounded text embedding for surgical VQA.
PitVQA comprises 25 procedural videos and a rich collection of question-answer pairs spanning crucial surgical aspects such as phase and step recognition, context understanding, tool detection and localization, and tool-tissue interactions.
PitVQA-Net consists of a novel image-grounded text embedding that projects image and text features into a shared embedding space and GPT2
arXiv Detail & Related papers (2024-05-22T19:30:24Z) - XAIQA: Explainer-Based Data Augmentation for Extractive Question
Answering [1.1867812760085572]
We introduce a novel approach, XAIQA, for generating synthetic QA pairs at scale from data naturally available in electronic health records.
Our method uses the idea of a classification model explainer to generate questions and answers about medical concepts corresponding to medical codes.
arXiv Detail & Related papers (2023-12-06T15:59:06Z) - PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering [56.25766322554655]
Medical Visual Question Answering (MedVQA) presents a significant opportunity to enhance diagnostic accuracy and healthcare delivery.
We propose a generative-based model for medical visual understanding by aligning visual information from a pre-trained vision encoder with a large language model.
We train the proposed model on PMC-VQA and then fine-tune it on multiple public benchmarks, e.g., VQA-RAD, SLAKE, and Image-Clef 2019.
arXiv Detail & Related papers (2023-05-17T17:50:16Z) - Medical Visual Question Answering: A Survey [55.53205317089564]
Medical Visual Question Answering(VQA) is a combination of medical artificial intelligence and popular VQA challenges.
Given a medical image and a clinically relevant question in natural language, the medical VQA system is expected to predict a plausible and convincing answer.
arXiv Detail & Related papers (2021-11-19T05:55:15Z) - Will this Question be Answered? Question Filtering via Answer Model
Distillation for Efficient Question Answering [99.66470885217623]
We propose a novel approach towards improving the efficiency of Question Answering (QA) systems by filtering out questions that will not be answered by them.
This is based on an interesting new finding: the answer confidence scores of state-of-the-art QA systems can be approximated well by models solely using the input question text.
arXiv Detail & Related papers (2021-09-14T23:07:49Z) - Hierarchical Deep Multi-modal Network for Medical Visual Question
Answering [25.633660028022195]
We propose a hierarchical deep multi-modal network that analyzes and classifies end-user questions/queries.
We integrate the QS model to the hierarchical deep multi-modal neural network to generate proper answers to the queries related to medical images.
arXiv Detail & Related papers (2020-09-27T07:24:41Z) - Multiple interaction learning with question-type prior knowledge for
constraining answer search space in visual question answering [24.395733613284534]
We propose a novel VQA model that utilizes the question-type prior information to improve VQA.
The solid experiments on two benchmark datasets, i.e., VQA 2.0 and TDIUC, indicate that the proposed method yields the best performance with the most competitive approaches.
arXiv Detail & Related papers (2020-09-23T12:54:34Z) - Interpretable Multi-Step Reasoning with Knowledge Extraction on Complex
Healthcare Question Answering [89.76059961309453]
HeadQA dataset contains multiple-choice questions authorized for the public healthcare specialization exam.
These questions are the most challenging for current QA systems.
We present a Multi-step reasoning with Knowledge extraction framework (MurKe)
We are striving to make full use of off-the-shelf pre-trained models.
arXiv Detail & Related papers (2020-08-06T02:47:46Z) - C3VQG: Category Consistent Cyclic Visual Question Generation [51.339348810676896]
Visual Question Generation (VQG) is the task of generating natural questions based on an image.
In this paper, we try to exploit the different visual cues and concepts in an image to generate questions using a variational autoencoder (VAE) without ground-truth answers.
Our approach solves two major shortcomings of existing VQG systems: (i) minimize the level of supervision and (ii) replace generic questions with category relevant generations.
arXiv Detail & Related papers (2020-05-15T20:25:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.