A Dual-Attention Learning Network with Word and Sentence Embedding for
Medical Visual Question Answering
- URL: http://arxiv.org/abs/2210.00220v1
- Date: Sat, 1 Oct 2022 08:32:40 GMT
- Title: A Dual-Attention Learning Network with Word and Sentence Embedding for
Medical Visual Question Answering
- Authors: Xiaofei Huang, Hongfang Gong
- Abstract summary: Research in medical visual question answering (MVQA) can contribute to the development of computer-aided diagnosis.
Existing MVQA question extraction schemes mainly focus on word information, ignoring medical information in the text.
In this study, a dual-attention learning network with word and sentence embedding (WSDAN) is proposed.
- Score: 2.0559497209595823
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Research in medical visual question answering (MVQA) can contribute to the
development of computeraided diagnosis. MVQA is a task that aims to predict
accurate and convincing answers based on given medical images and associated
natural language questions. This task requires extracting medical
knowledge-rich feature content and making fine-grained understandings of them.
Therefore, constructing an effective feature extraction and understanding
scheme are keys to modeling. Existing MVQA question extraction schemes mainly
focus on word information, ignoring medical information in the text. Meanwhile,
some visual and textual feature understanding schemes cannot effectively
capture the correlation between regions and keywords for reasonable visual
reasoning. In this study, a dual-attention learning network with word and
sentence embedding (WSDAN) is proposed. We design a module, transformer with
sentence embedding (TSE), to extract a double embedding representation of
questions containing keywords and medical information. A dualattention learning
(DAL) module consisting of self-attention and guided attention is proposed to
model intensive intramodal and intermodal interactions. With multiple DAL
modules (DALs), learning visual and textual co-attention can increase the
granularity of understanding and improve visual reasoning. Experimental results
on the ImageCLEF 2019 VQA-MED (VQA-MED 2019) and VQA-RAD datasets demonstrate
that our proposed method outperforms previous state-of-the-art methods.
According to the ablation studies and Grad-CAM maps, WSDAN can extract rich
textual information and has strong visual reasoning ability.
Related papers
- ViKL: A Mammography Interpretation Framework via Multimodal Aggregation of Visual-knowledge-linguistic Features [54.37042005469384]
We announce MVKL, the first multimodal mammography dataset encompassing multi-view images, detailed manifestations and reports.
Based on this dataset, we focus on the challanging task of unsupervised pretraining.
We propose ViKL, a framework that synergizes Visual, Knowledge, and Linguistic features.
arXiv Detail & Related papers (2024-09-24T05:01:23Z) - Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - Object Attribute Matters in Visual Question Answering [15.705504296316576]
We propose a novel VQA approach from the perspective of utilizing object attribute.
The attribute fusion module constructs a multimodal graph neural network to fuse attributes and visual features through message passing.
The better object-level visual-language alignment aids in understanding multimodal scenes, thereby improving the model's robustness.
arXiv Detail & Related papers (2023-12-20T12:46:30Z) - Masked Vision and Language Pre-training with Unimodal and Multimodal
Contrastive Losses for Medical Visual Question Answering [7.669872220702526]
We present a novel self-supervised approach that learns unimodal and multimodal feature representations of input images and text.
The proposed approach achieves state-of-the-art (SOTA) performance on three publicly available medical VQA datasets.
arXiv Detail & Related papers (2023-07-11T15:00:11Z) - LLM-CXR: Instruction-Finetuned LLM for CXR Image Understanding and Generation [51.08810811457617]
vision-language alignment in LLMs is actively being researched to enable multimodal reasoning and visual IO.
We develop a method for instruction-tuning an LLM only on text to gain vision-language capabilities for medical images.
Our model, LLM-CXR, trained in this approach shows better image-text alignment in both CXR understanding and generation tasks.
arXiv Detail & Related papers (2023-05-19T07:44:39Z) - VQA with Cascade of Self- and Co-Attention Blocks [3.0013352260516744]
This work aims to learn an improved multi-modal representation through dense interaction of visual and textual modalities.
The proposed model has an attention block containing both self-attention and co-attention on image and text.
arXiv Detail & Related papers (2023-02-28T17:20:40Z) - Self-supervised vision-language pretraining for Medical visual question
answering [9.073820229958054]
We propose a self-supervised method that applies Masked image modeling, Masked language modeling, Image text matching and Image text alignment via contrastive learning (M2I2) for pretraining.
The proposed method achieves state-of-the-art performance on all the three public medical VQA datasets.
arXiv Detail & Related papers (2022-11-24T13:31:56Z) - Align, Reason and Learn: Enhancing Medical Vision-and-Language
Pre-training with Knowledge [68.90835997085557]
We propose a systematic and effective approach to enhance structured medical knowledge from three perspectives.
First, we align the representations of the vision encoder and the language encoder through knowledge.
Second, we inject knowledge into the multi-modal fusion model to enable the model to perform reasoning using knowledge as the supplementation of the input image and text.
Third, we guide the model to put emphasis on the most critical information in images and texts by designing knowledge-induced pretext tasks.
arXiv Detail & Related papers (2022-09-15T08:00:01Z) - MGA-VQA: Multi-Granularity Alignment for Visual Question Answering [75.55108621064726]
Learning to answer visual questions is a challenging task since the multi-modal inputs are within two feature spaces.
We propose Multi-Granularity Alignment architecture for Visual Question Answering task (MGA-VQA)
Our model splits alignment into different levels to achieve learning better correlations without needing additional data and annotations.
arXiv Detail & Related papers (2022-01-25T22:30:54Z) - MuVAM: A Multi-View Attention-based Model for Medical Visual Question
Answering [2.413694065650786]
This paper proposes a multi-view attention-based model(MuVAM) for medical visual question answering.
It integrates the high-level semantics of medical images on the basis of text description.
Experiments on two datasets show that the effectiveness of MuVAM surpasses the state-of-the-art method.
arXiv Detail & Related papers (2021-07-07T13:40:25Z) - Object Relational Graph with Teacher-Recommended Learning for Video
Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy.
Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation.
Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.