Masked Vision and Language Pre-training with Unimodal and Multimodal
Contrastive Losses for Medical Visual Question Answering
- URL: http://arxiv.org/abs/2307.05314v1
- Date: Tue, 11 Jul 2023 15:00:11 GMT
- Title: Masked Vision and Language Pre-training with Unimodal and Multimodal
Contrastive Losses for Medical Visual Question Answering
- Authors: Pengfei Li, Gang Liu, Jinlong He, Zixu Zhao and Shenjun Zhong
- Abstract summary: We present a novel self-supervised approach that learns unimodal and multimodal feature representations of input images and text.
The proposed approach achieves state-of-the-art (SOTA) performance on three publicly available medical VQA datasets.
- Score: 7.669872220702526
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Medical visual question answering (VQA) is a challenging task that requires
answering clinical questions of a given medical image, by taking consider of
both visual and language information. However, due to the small scale of
training data for medical VQA, pre-training fine-tuning paradigms have been a
commonly used solution to improve model generalization performance. In this
paper, we present a novel self-supervised approach that learns unimodal and
multimodal feature representations of input images and text using medical image
caption datasets, by leveraging both unimodal and multimodal contrastive
losses, along with masked language modeling and image text matching as
pretraining objectives. The pre-trained model is then transferred to downstream
medical VQA tasks. The proposed approach achieves state-of-the-art (SOTA)
performance on three publicly available medical VQA datasets with significant
accuracy improvements of 2.2%, 14.7%, and 1.7% respectively. Besides, we
conduct a comprehensive analysis to validate the effectiveness of different
components of the approach and study different pre-training settings. Our codes
and models are available at https://github.com/pengfeiliHEU/MUMC.
Related papers
- LoGra-Med: Long Context Multi-Graph Alignment for Medical Vision-Language Model [55.80651780294357]
State-of-the-art medical multi-modal large language models (med-MLLM) leverage instruction-following data in pre-training.
LoGra-Med is a new multi-graph alignment algorithm that enforces triplet correlations across image modalities, conversation-based descriptions, and extended captions.
Our results show LoGra-Med matches LLAVA-Med performance on 600K image-text pairs for Medical VQA and significantly outperforms it when trained on 10% of the data.
arXiv Detail & Related papers (2024-10-03T15:52:03Z) - Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly.
Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness.
Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings.
This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z) - MISS: A Generative Pretraining and Finetuning Approach for Med-VQA [16.978523518972533]
We propose a large-scale MultI-task Self-Supervised learning based framework (MISS) for medical VQA tasks.
We unify the text encoder and multimodal encoder and align image-text features through multi-task learning.
Our method achieves excellent results with fewer multimodal datasets and demonstrates the advantages of generative VQA models.
arXiv Detail & Related papers (2024-01-10T13:56:40Z) - LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical
Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets.
We have collected approximately 1.3 million medical images from 55 publicly available datasets.
LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z) - Customizing General-Purpose Foundation Models for Medical Report
Generation [64.31265734687182]
The scarcity of labelled medical image-report pairs presents great challenges in the development of deep and large-scale neural networks.
We propose customizing off-the-shelf general-purpose large-scale pre-trained models, i.e., foundation models (FMs) in computer vision and natural language processing.
arXiv Detail & Related papers (2023-06-09T03:02:36Z) - PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering [56.25766322554655]
Medical Visual Question Answering (MedVQA) presents a significant opportunity to enhance diagnostic accuracy and healthcare delivery.
We propose a generative-based model for medical visual understanding by aligning visual information from a pre-trained vision encoder with a large language model.
We train the proposed model on PMC-VQA and then fine-tune it on multiple public benchmarks, e.g., VQA-RAD, SLAKE, and Image-Clef 2019.
arXiv Detail & Related papers (2023-05-17T17:50:16Z) - RAMM: Retrieval-augmented Biomedical Visual Question Answering with
Multi-modal Pre-training [45.38823400370285]
Vision-and-language multi-modal pretraining and fine-tuning have shown great success in visual question answering (VQA)
In this paper, we propose a retrieval-augmented pretrain-and-finetune paradigm named RAMM for biomedical VQA.
arXiv Detail & Related papers (2023-03-01T14:21:19Z) - Self-supervised vision-language pretraining for Medical visual question
answering [9.073820229958054]
We propose a self-supervised method that applies Masked image modeling, Masked language modeling, Image text matching and Image text alignment via contrastive learning (M2I2) for pretraining.
The proposed method achieves state-of-the-art performance on all the three public medical VQA datasets.
arXiv Detail & Related papers (2022-11-24T13:31:56Z) - MuVAM: A Multi-View Attention-based Model for Medical Visual Question
Answering [2.413694065650786]
This paper proposes a multi-view attention-based model(MuVAM) for medical visual question answering.
It integrates the high-level semantics of medical images on the basis of text description.
Experiments on two datasets show that the effectiveness of MuVAM surpasses the state-of-the-art method.
arXiv Detail & Related papers (2021-07-07T13:40:25Z) - MMBERT: Multimodal BERT Pretraining for Improved Medical VQA [23.78515287446131]
We propose a solution inspired by self-supervised pretraining of Transformer-style architectures for NLP, Vision and Language tasks.
Our method involves learning richer medical image and text semantic representations using Masked Language Modeling.
The proposed solution achieves new state-of-the-art performance on two VQA datasets for radiology images.
arXiv Detail & Related papers (2021-04-03T13:01:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.