MMBERT: Multimodal BERT Pretraining for Improved Medical VQA
- URL: http://arxiv.org/abs/2104.01394v1
- Date: Sat, 3 Apr 2021 13:01:19 GMT
- Title: MMBERT: Multimodal BERT Pretraining for Improved Medical VQA
- Authors: Yash Khare, Viraj Bagal, Minesh Mathew, Adithi Devi, U Deva
Priyakumar, CV Jawahar
- Abstract summary: We propose a solution inspired by self-supervised pretraining of Transformer-style architectures for NLP, Vision and Language tasks.
Our method involves learning richer medical image and text semantic representations using Masked Language Modeling.
The proposed solution achieves new state-of-the-art performance on two VQA datasets for radiology images.
- Score: 23.78515287446131
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Images in the medical domain are fundamentally different from the general
domain images. Consequently, it is infeasible to directly employ general domain
Visual Question Answering (VQA) models for the medical domain. Additionally,
medical images annotation is a costly and time-consuming process. To overcome
these limitations, we propose a solution inspired by self-supervised
pretraining of Transformer-style architectures for NLP, Vision and Language
tasks. Our method involves learning richer medical image and text semantic
representations using Masked Language Modeling (MLM) with image features as the
pretext task on a large medical image+caption dataset. The proposed solution
achieves new state-of-the-art performance on two VQA datasets for radiology
images -- VQA-Med 2019 and VQA-RAD, outperforming even the ensemble models of
previous best solutions. Moreover, our solution provides attention maps which
help in model interpretability. The code is available at
https://github.com/VirajBagal/MMBERT
Related papers
- Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - MISS: A Generative Pretraining and Finetuning Approach for Med-VQA [16.978523518972533]
We propose a large-scale MultI-task Self-Supervised learning based framework (MISS) for medical VQA tasks.
We unify the text encoder and multimodal encoder and align image-text features through multi-task learning.
Our method achieves excellent results with fewer multimodal datasets and demonstrates the advantages of generative VQA models.
arXiv Detail & Related papers (2024-01-10T13:56:40Z) - Med-Flamingo: a Multimodal Medical Few-shot Learner [58.85676013818811]
We propose Med-Flamingo, a multimodal few-shot learner adapted to the medical domain.
Based on OpenFlamingo-9B, we continue pre-training on paired and interleaved medical image-text data from publications and textbooks.
We conduct the first human evaluation for generative medical VQA where physicians review the problems and blinded generations in an interactive app.
arXiv Detail & Related papers (2023-07-27T20:36:02Z) - Masked Vision and Language Pre-training with Unimodal and Multimodal
Contrastive Losses for Medical Visual Question Answering [7.669872220702526]
We present a novel self-supervised approach that learns unimodal and multimodal feature representations of input images and text.
The proposed approach achieves state-of-the-art (SOTA) performance on three publicly available medical VQA datasets.
arXiv Detail & Related papers (2023-07-11T15:00:11Z) - LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical
Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets.
We have collected approximately 1.3 million medical images from 55 publicly available datasets.
LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z) - Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA
Tasks? A: Self-Train on Unlabeled Images! [103.09776737512077]
SelTDA (Self-Taught Data Augmentation) is a strategy for finetuning large vision language models on small-scale VQA datasets.
It generates question-answer pseudolabels directly conditioned on an image, allowing us to pseudolabel unlabeled images.
We describe a series of experiments showing that our self-taught data augmentation increases robustness to adversarially searched questions.
arXiv Detail & Related papers (2023-06-06T18:00:47Z) - PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering [35.64805788623848]
We focus on the problem of Medical Visual Question Answering (MedVQA)
We propose a generative-based model for medical visual understanding by aligning visual information from a pre-trained vision encoder with a large language model.
arXiv Detail & Related papers (2023-05-17T17:50:16Z) - RAMM: Retrieval-augmented Biomedical Visual Question Answering with
Multi-modal Pre-training [45.38823400370285]
Vision-and-language multi-modal pretraining and fine-tuning have shown great success in visual question answering (VQA)
In this paper, we propose a retrieval-augmented pretrain-and-finetune paradigm named RAMM for biomedical VQA.
arXiv Detail & Related papers (2023-03-01T14:21:19Z) - Learning to Exploit Temporal Structure for Biomedical Vision-Language
Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities.
We explicitly account for prior images and reports when available during both training and fine-tuning.
Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z) - Self-supervised vision-language pretraining for Medical visual question
answering [9.073820229958054]
We propose a self-supervised method that applies Masked image modeling, Masked language modeling, Image text matching and Image text alignment via contrastive learning (M2I2) for pretraining.
The proposed method achieves state-of-the-art performance on all the three public medical VQA datasets.
arXiv Detail & Related papers (2022-11-24T13:31:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.