MuVAM: A Multi-View Attention-based Model for Medical Visual Question
Answering
- URL: http://arxiv.org/abs/2107.03216v1
- Date: Wed, 7 Jul 2021 13:40:25 GMT
- Title: MuVAM: A Multi-View Attention-based Model for Medical Visual Question
Answering
- Authors: Haiwei Pan, Shuning He, Kejia Zhang, Bo Qu, Chunling Chen, and Kun Shi
- Abstract summary: This paper proposes a multi-view attention-based model(MuVAM) for medical visual question answering.
It integrates the high-level semantics of medical images on the basis of text description.
Experiments on two datasets show that the effectiveness of MuVAM surpasses the state-of-the-art method.
- Score: 2.413694065650786
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Medical Visual Question Answering (VQA) is a multi-modal challenging task
widely considered by research communities of the computer vision and natural
language processing. Since most current medical VQA models focus on visual
content, ignoring the importance of text, this paper proposes a multi-view
attention-based model(MuVAM) for medical visual question answering which
integrates the high-level semantics of medical images on the basis of text
description. Firstly, different methods are utilized to extract the features of
the image and the question for the two modalities of vision and text. Secondly,
this paper proposes a multi-view attention mechanism that include
Image-to-Question (I2Q) attention and Word-to-Text (W2T) attention. Multi-view
attention can correlate the question with image and word in order to better
analyze the question and get an accurate answer. Thirdly, a composite loss is
presented to predict the answer accurately after multi-modal feature fusion and
improve the similarity between visual and textual cross-modal features. It
consists of classification loss and image-question complementary (IQC) loss.
Finally, for data errors and missing labels in the VQA-RAD dataset, we
collaborate with medical experts to correct and complete this dataset and then
construct an enhanced dataset, VQA-RADPh. The experiments on these two datasets
show that the effectiveness of MuVAM surpasses the state-of-the-art method.
Related papers
- Masked Vision and Language Pre-training with Unimodal and Multimodal
Contrastive Losses for Medical Visual Question Answering [7.669872220702526]
We present a novel self-supervised approach that learns unimodal and multimodal feature representations of input images and text.
The proposed approach achieves state-of-the-art (SOTA) performance on three publicly available medical VQA datasets.
arXiv Detail & Related papers (2023-07-11T15:00:11Z) - RAMM: Retrieval-augmented Biomedical Visual Question Answering with
Multi-modal Pre-training [45.38823400370285]
Vision-and-language multi-modal pretraining and fine-tuning have shown great success in visual question answering (VQA)
In this paper, we propose a retrieval-augmented pretrain-and-finetune paradigm named RAMM for biomedical VQA.
arXiv Detail & Related papers (2023-03-01T14:21:19Z) - Interpretable Medical Image Visual Question Answering via Multi-Modal
Relationship Graph Learning [45.746882253686856]
Medical visual question answering (VQA) aims to answer clinically relevant questions regarding input medical images.
We first collected a comprehensive and large-scale medical VQA dataset, focusing on chest X-ray images.
Based on this dataset, we also propose a novel baseline method by constructing three different relationship graphs.
arXiv Detail & Related papers (2023-02-19T17:46:16Z) - Learning to Exploit Temporal Structure for Biomedical Vision-Language
Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities.
We explicitly account for prior images and reports when available during both training and fine-tuning.
Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z) - Self-supervised vision-language pretraining for Medical visual question
answering [9.073820229958054]
We propose a self-supervised method that applies Masked image modeling, Masked language modeling, Image text matching and Image text alignment via contrastive learning (M2I2) for pretraining.
The proposed method achieves state-of-the-art performance on all the three public medical VQA datasets.
arXiv Detail & Related papers (2022-11-24T13:31:56Z) - A Dual-Attention Learning Network with Word and Sentence Embedding for
Medical Visual Question Answering [2.0559497209595823]
Research in medical visual question answering (MVQA) can contribute to the development of computer-aided diagnosis.
Existing MVQA question extraction schemes mainly focus on word information, ignoring medical information in the text.
In this study, a dual-attention learning network with word and sentence embedding (WSDAN) is proposed.
arXiv Detail & Related papers (2022-10-01T08:32:40Z) - MGA-VQA: Multi-Granularity Alignment for Visual Question Answering [75.55108621064726]
Learning to answer visual questions is a challenging task since the multi-modal inputs are within two feature spaces.
We propose Multi-Granularity Alignment architecture for Visual Question Answering task (MGA-VQA)
Our model splits alignment into different levels to achieve learning better correlations without needing additional data and annotations.
arXiv Detail & Related papers (2022-01-25T22:30:54Z) - MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media
Knowledge Extraction and Grounding [131.8797942031366]
We present a new QA evaluation benchmark with 1,384 questions over news articles that require cross-media grounding of objects in images onto text.
Specifically, the task involves multi-hop questions that require reasoning over image-caption pairs to identify the grounded visual object being referred to and then predicting a span from the news body text to answer the question.
We introduce a novel multimedia data augmentation framework, based on cross-media knowledge extraction and synthetic question-answer generation, to automatically augment data that can provide weak supervision for this task.
arXiv Detail & Related papers (2021-12-20T18:23:30Z) - Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in
Visual Question Answering [71.6781118080461]
We propose a Graph Matching Attention (GMA) network for Visual Question Answering (VQA) task.
firstly, it builds graph for the image, but also constructs graph for the question in terms of both syntactic and embedding information.
Next, we explore the intra-modality relationships by a dual-stage graph encoder and then present a bilateral cross-modality graph matching attention to infer the relationships between the image and the question.
Experiments demonstrate that our network achieves state-of-the-art performance on the GQA dataset and the VQA 2.0 dataset.
arXiv Detail & Related papers (2021-12-14T10:01:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.