MuVAM: A Multi-View Attention-based Model for Medical Visual Question
Answering
- URL: http://arxiv.org/abs/2107.03216v1
- Date: Wed, 7 Jul 2021 13:40:25 GMT
- Title: MuVAM: A Multi-View Attention-based Model for Medical Visual Question
Answering
- Authors: Haiwei Pan, Shuning He, Kejia Zhang, Bo Qu, Chunling Chen, and Kun Shi
- Abstract summary: This paper proposes a multi-view attention-based model(MuVAM) for medical visual question answering.
It integrates the high-level semantics of medical images on the basis of text description.
Experiments on two datasets show that the effectiveness of MuVAM surpasses the state-of-the-art method.
- Score: 2.413694065650786
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Medical Visual Question Answering (VQA) is a multi-modal challenging task
widely considered by research communities of the computer vision and natural
language processing. Since most current medical VQA models focus on visual
content, ignoring the importance of text, this paper proposes a multi-view
attention-based model(MuVAM) for medical visual question answering which
integrates the high-level semantics of medical images on the basis of text
description. Firstly, different methods are utilized to extract the features of
the image and the question for the two modalities of vision and text. Secondly,
this paper proposes a multi-view attention mechanism that include
Image-to-Question (I2Q) attention and Word-to-Text (W2T) attention. Multi-view
attention can correlate the question with image and word in order to better
analyze the question and get an accurate answer. Thirdly, a composite loss is
presented to predict the answer accurately after multi-modal feature fusion and
improve the similarity between visual and textual cross-modal features. It
consists of classification loss and image-question complementary (IQC) loss.
Finally, for data errors and missing labels in the VQA-RAD dataset, we
collaborate with medical experts to correct and complete this dataset and then
construct an enhanced dataset, VQA-RADPh. The experiments on these two datasets
show that the effectiveness of MuVAM surpasses the state-of-the-art method.
Related papers
- ViKL: A Mammography Interpretation Framework via Multimodal Aggregation of Visual-knowledge-linguistic Features [54.37042005469384]
We announce MVKL, the first multimodal mammography dataset encompassing multi-view images, detailed manifestations and reports.
Based on this dataset, we focus on the challanging task of unsupervised pretraining.
We propose ViKL, a framework that synergizes Visual, Knowledge, and Linguistic features.
arXiv Detail & Related papers (2024-09-24T05:01:23Z) - Masked Vision and Language Pre-training with Unimodal and Multimodal
Contrastive Losses for Medical Visual Question Answering [7.669872220702526]
We present a novel self-supervised approach that learns unimodal and multimodal feature representations of input images and text.
The proposed approach achieves state-of-the-art (SOTA) performance on three publicly available medical VQA datasets.
arXiv Detail & Related papers (2023-07-11T15:00:11Z) - PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering [56.25766322554655]
Medical Visual Question Answering (MedVQA) presents a significant opportunity to enhance diagnostic accuracy and healthcare delivery.
We propose a generative-based model for medical visual understanding by aligning visual information from a pre-trained vision encoder with a large language model.
We train the proposed model on PMC-VQA and then fine-tune it on multiple public benchmarks, e.g., VQA-RAD, SLAKE, and Image-Clef 2019.
arXiv Detail & Related papers (2023-05-17T17:50:16Z) - RAMM: Retrieval-augmented Biomedical Visual Question Answering with
Multi-modal Pre-training [45.38823400370285]
Vision-and-language multi-modal pretraining and fine-tuning have shown great success in visual question answering (VQA)
In this paper, we propose a retrieval-augmented pretrain-and-finetune paradigm named RAMM for biomedical VQA.
arXiv Detail & Related papers (2023-03-01T14:21:19Z) - Interpretable Medical Image Visual Question Answering via Multi-Modal
Relationship Graph Learning [45.746882253686856]
Medical visual question answering (VQA) aims to answer clinically relevant questions regarding input medical images.
We first collected a comprehensive and large-scale medical VQA dataset, focusing on chest X-ray images.
Based on this dataset, we also propose a novel baseline method by constructing three different relationship graphs.
arXiv Detail & Related papers (2023-02-19T17:46:16Z) - Learning to Exploit Temporal Structure for Biomedical Vision-Language
Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities.
We explicitly account for prior images and reports when available during both training and fine-tuning.
Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z) - Self-supervised vision-language pretraining for Medical visual question
answering [9.073820229958054]
We propose a self-supervised method that applies Masked image modeling, Masked language modeling, Image text matching and Image text alignment via contrastive learning (M2I2) for pretraining.
The proposed method achieves state-of-the-art performance on all the three public medical VQA datasets.
arXiv Detail & Related papers (2022-11-24T13:31:56Z) - A Dual-Attention Learning Network with Word and Sentence Embedding for
Medical Visual Question Answering [2.0559497209595823]
Research in medical visual question answering (MVQA) can contribute to the development of computer-aided diagnosis.
Existing MVQA question extraction schemes mainly focus on word information, ignoring medical information in the text.
In this study, a dual-attention learning network with word and sentence embedding (WSDAN) is proposed.
arXiv Detail & Related papers (2022-10-01T08:32:40Z) - Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in
Visual Question Answering [71.6781118080461]
We propose a Graph Matching Attention (GMA) network for Visual Question Answering (VQA) task.
firstly, it builds graph for the image, but also constructs graph for the question in terms of both syntactic and embedding information.
Next, we explore the intra-modality relationships by a dual-stage graph encoder and then present a bilateral cross-modality graph matching attention to infer the relationships between the image and the question.
Experiments demonstrate that our network achieves state-of-the-art performance on the GQA dataset and the VQA 2.0 dataset.
arXiv Detail & Related papers (2021-12-14T10:01:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.