Related papers: MMBERT: Multimodal BERT Pretraining for Improved Medical VQA

MMBERT: Multimodal BERT Pretraining for Improved Medical VQA

URL: http://arxiv.org/abs/2104.01394v1
Date: Sat, 3 Apr 2021 13:01:19 GMT
Title: MMBERT: Multimodal BERT Pretraining for Improved Medical VQA
Authors: Yash Khare, Viraj Bagal, Minesh Mathew, Adithi Devi, U Deva Priyakumar, CV Jawahar
Abstract summary: We propose a solution inspired by self-supervised pretraining of Transformer-style architectures for NLP, Vision and Language tasks. Our method involves learning richer medical image and text semantic representations using Masked Language Modeling. The proposed solution achieves new state-of-the-art performance on two VQA datasets for radiology images.
Score: 23.78515287446131
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Images in the medical domain are fundamentally different from the general domain images. Consequently, it is infeasible to directly employ general domain Visual Question Answering (VQA) models for the medical domain. Additionally, medical images annotation is a costly and time-consuming process. To overcome these limitations, we propose a solution inspired by self-supervised pretraining of Transformer-style architectures for NLP, Vision and Language tasks. Our method involves learning richer medical image and text semantic representations using Masked Language Modeling (MLM) with image features as the pretext task on a large medical image+caption dataset. The proposed solution achieves new state-of-the-art performance on two VQA datasets for radiology images -- VQA-Med 2019 and VQA-RAD, outperforming even the ensemble models of previous best solutions. Moreover, our solution provides attention maps which help in model interpretability. The code is available at https://github.com/VirajBagal/MMBERT

Related papers

Querying GI Endoscopy Images: A VQA Approach [0.0]
VQA (Visual Question Answering) combines Natural Language Processing (NLP) with image understanding to answer questions about a given image.<n>This study is a submission for ImageCLEFmed-MEDVQA-GI 2025 subtask 1 that explores the adaptation of the Florence2 model to answer medical visual questions on GI endoscopy images.
arXiv Detail & Related papers (2025-07-25T13:03:46Z)
LoGra-Med: Long Context Multi-Graph Alignment for Medical Vision-Language Model [55.80651780294357]
State-of-the-art medical multi-modal large language models (med-MLLM) leverage instruction-following data in pre-training. LoGra-Med is a new multi-graph alignment algorithm that enforces triplet correlations across image modalities, conversation-based descriptions, and extended captions. Our results show LoGra-Med matches LLAVA-Med performance on 600K image-text pairs for Medical VQA and significantly outperforms it when trained on 10% of the data.
arXiv Detail & Related papers (2024-10-03T15:52:03Z)
Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed. In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset. We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z)
MISS: A Generative Pretraining and Finetuning Approach for Med-VQA [16.978523518972533]
We propose a large-scale MultI-task Self-Supervised learning based framework (MISS) for medical VQA tasks. We unify the text encoder and multimodal encoder and align image-text features through multi-task learning. Our method achieves excellent results with fewer multimodal datasets and demonstrates the advantages of generative VQA models.
arXiv Detail & Related papers (2024-01-10T13:56:40Z)
Med-Flamingo: a Multimodal Medical Few-shot Learner [58.85676013818811]
We propose Med-Flamingo, a multimodal few-shot learner adapted to the medical domain. Based on OpenFlamingo-9B, we continue pre-training on paired and interleaved medical image-text data from publications and textbooks. We conduct the first human evaluation for generative medical VQA where physicians review the problems and blinded generations in an interactive app.
arXiv Detail & Related papers (2023-07-27T20:36:02Z)
Masked Vision and Language Pre-training with Unimodal and Multimodal Contrastive Losses for Medical Visual Question Answering [7.669872220702526]
We present a novel self-supervised approach that learns unimodal and multimodal feature representations of input images and text. The proposed approach achieves state-of-the-art (SOTA) performance on three publicly available medical VQA datasets.
arXiv Detail & Related papers (2023-07-11T15:00:11Z)
LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets. We have collected approximately 1.3 million medical images from 55 publicly available datasets. LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z)
Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images! [103.09776737512077]
SelTDA (Self-Taught Data Augmentation) is a strategy for finetuning large vision language models on small-scale VQA datasets. It generates question-answer pseudolabels directly conditioned on an image, allowing us to pseudolabel unlabeled images. We describe a series of experiments showing that our self-taught data augmentation increases robustness to adversarially searched questions.
arXiv Detail & Related papers (2023-06-06T18:00:47Z)
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering [56.25766322554655]
Medical Visual Question Answering (MedVQA) presents a significant opportunity to enhance diagnostic accuracy and healthcare delivery. We propose a generative-based model for medical visual understanding by aligning visual information from a pre-trained vision encoder with a large language model. We train the proposed model on PMC-VQA and then fine-tune it on multiple public benchmarks, e.g., VQA-RAD, SLAKE, and Image-Clef 2019.
arXiv Detail & Related papers (2023-05-17T17:50:16Z)
Open-Ended Medical Visual Question Answering Through Prefix Tuning of Language Models [42.360431316298204]
We focus on open-ended VQA and motivated by the recent advances in language models consider it as a generative task. To properly communicate the medical images to the language model, we develop a network that maps the extracted visual features to a set of learnable tokens. We evaluate our approach on the prime medical VQA benchmarks, namely, Slake, OVQA and PathVQA.
arXiv Detail & Related papers (2023-03-10T15:17:22Z)
RAMM: Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-training [45.38823400370285]
Vision-and-language multi-modal pretraining and fine-tuning have shown great success in visual question answering (VQA) In this paper, we propose a retrieval-augmented pretrain-and-finetune paradigm named RAMM for biomedical VQA.
arXiv Detail & Related papers (2023-03-01T14:21:19Z)
Self-supervised vision-language pretraining for Medical visual question answering [9.073820229958054]
We propose a self-supervised method that applies Masked image modeling, Masked language modeling, Image text matching and Image text alignment via contrastive learning (M2I2) for pretraining. The proposed method achieves state-of-the-art performance on all the three public medical VQA datasets.
arXiv Detail & Related papers (2022-11-24T13:31:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.