Medical visual question answering using joint self-supervised learning
- URL: http://arxiv.org/abs/2302.13069v1
- Date: Sat, 25 Feb 2023 12:12:22 GMT
- Title: Medical visual question answering using joint self-supervised learning
- Authors: Yuan Zhou, Jing Mei, Yiqin Yu, Tanveer Syeda-Mahmood
- Abstract summary: The encoder embeds across the image-text dual modalities with self-attention mechanism.
The decoder is connected to the top of the encoder and fine-tuned using the small-sized medical VQA dataset.
- Score: 8.817054025763325
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual Question Answering (VQA) becomes one of the most active research
problems in the medical imaging domain. A well-known VQA challenge is the
intrinsic diversity between the image and text modalities, and in the medical
VQA task, there is another critical problem relying on the limited size of
labelled image-question-answer data. In this study we propose an
encoder-decoder framework that leverages the image-text joint representation
learned from large-scaled medical image-caption data and adapted to the
small-sized medical VQA task. The encoder embeds across the image-text dual
modalities with self-attention mechanism and is independently pre-trained on
the large-scaled medical image-caption dataset by multiple self-supervised
learning tasks. Then the decoder is connected to the top of the encoder and
fine-tuned using the small-sized medical VQA dataset. The experiment results
present that our proposed method achieves better performance comparing with the
baseline and SOTA methods.
Related papers
- MISS: A Generative Pretraining and Finetuning Approach for Med-VQA [16.978523518972533]
We propose a large-scale MultI-task Self-Supervised learning based framework (MISS) for medical VQA tasks.
We unify the text encoder and multimodal encoder and align image-text features through multi-task learning.
Our method achieves excellent results with fewer multimodal datasets and demonstrates the advantages of generative VQA models.
arXiv Detail & Related papers (2024-01-10T13:56:40Z) - Expert Knowledge-Aware Image Difference Graph Representation Learning for Difference-Aware Medical Visual Question Answering [45.058569118999436]
Given a pair of main and reference images, this task attempts to answer several questions on both diseases.
We collect a new dataset, namely MIMIC-Diff-VQA, including 700,703 QA pairs from 164,324 pairs of main and reference images.
arXiv Detail & Related papers (2023-07-22T05:34:18Z) - Masked Vision and Language Pre-training with Unimodal and Multimodal
Contrastive Losses for Medical Visual Question Answering [7.669872220702526]
We present a novel self-supervised approach that learns unimodal and multimodal feature representations of input images and text.
The proposed approach achieves state-of-the-art (SOTA) performance on three publicly available medical VQA datasets.
arXiv Detail & Related papers (2023-07-11T15:00:11Z) - LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical
Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets.
We have collected approximately 1.3 million medical images from 55 publicly available datasets.
LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z) - PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering [56.25766322554655]
Medical Visual Question Answering (MedVQA) presents a significant opportunity to enhance diagnostic accuracy and healthcare delivery.
We propose a generative-based model for medical visual understanding by aligning visual information from a pre-trained vision encoder with a large language model.
We train the proposed model on PMC-VQA and then fine-tune it on multiple public benchmarks, e.g., VQA-RAD, SLAKE, and Image-Clef 2019.
arXiv Detail & Related papers (2023-05-17T17:50:16Z) - RAMM: Retrieval-augmented Biomedical Visual Question Answering with
Multi-modal Pre-training [45.38823400370285]
Vision-and-language multi-modal pretraining and fine-tuning have shown great success in visual question answering (VQA)
In this paper, we propose a retrieval-augmented pretrain-and-finetune paradigm named RAMM for biomedical VQA.
arXiv Detail & Related papers (2023-03-01T14:21:19Z) - Interpretable Medical Image Visual Question Answering via Multi-Modal
Relationship Graph Learning [45.746882253686856]
Medical visual question answering (VQA) aims to answer clinically relevant questions regarding input medical images.
We first collected a comprehensive and large-scale medical VQA dataset, focusing on chest X-ray images.
Based on this dataset, we also propose a novel baseline method by constructing three different relationship graphs.
arXiv Detail & Related papers (2023-02-19T17:46:16Z) - MuVAM: A Multi-View Attention-based Model for Medical Visual Question
Answering [2.413694065650786]
This paper proposes a multi-view attention-based model(MuVAM) for medical visual question answering.
It integrates the high-level semantics of medical images on the basis of text description.
Experiments on two datasets show that the effectiveness of MuVAM surpasses the state-of-the-art method.
arXiv Detail & Related papers (2021-07-07T13:40:25Z) - Generative Adversarial U-Net for Domain-free Medical Image Augmentation [49.72048151146307]
The shortage of annotated medical images is one of the biggest challenges in the field of medical image computing.
In this paper, we develop a novel generative method named generative adversarial U-Net.
Our newly designed model is domain-free and generalizable to various medical images.
arXiv Detail & Related papers (2021-01-12T23:02:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.