Self-supervised vision-language pretraining for Medical visual question
answering
- URL: http://arxiv.org/abs/2211.13594v1
- Date: Thu, 24 Nov 2022 13:31:56 GMT
- Title: Self-supervised vision-language pretraining for Medical visual question
answering
- Authors: Pengfei Li, Gang Liu, Lin Tan, Jinying Liao and Shenjun Zhong
- Abstract summary: We propose a self-supervised method that applies Masked image modeling, Masked language modeling, Image text matching and Image text alignment via contrastive learning (M2I2) for pretraining.
The proposed method achieves state-of-the-art performance on all the three public medical VQA datasets.
- Score: 9.073820229958054
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Medical image visual question answering (VQA) is a task to answer clinical
questions, given a radiographic image, which is a challenging problem that
requires a model to integrate both vision and language information. To solve
medical VQA problems with a limited number of training data, pretrain-finetune
paradigm is widely used to improve the model generalization. In this paper, we
propose a self-supervised method that applies Masked image modeling, Masked
language modeling, Image text matching and Image text alignment via contrastive
learning (M2I2) for pretraining on medical image caption dataset, and finetunes
to downstream medical VQA tasks. The proposed method achieves state-of-the-art
performance on all the three public medical VQA datasets. Our codes and models
are available at https://github.com/pengfeiliHEU/M2I2.
Related papers
- LoGra-Med: Long Context Multi-Graph Alignment for Medical Vision-Language Model [55.80651780294357]
State-of-the-art medical multi-modal large language models (med-MLLM) leverage instruction-following data in pre-training.
LoGra-Med is a new multi-graph alignment algorithm that enforces triplet correlations across image modalities, conversation-based descriptions, and extended captions.
Our results show LoGra-Med matches LLAVA-Med performance on 600K image-text pairs for Medical VQA and significantly outperforms it when trained on 10% of the data.
arXiv Detail & Related papers (2024-10-03T15:52:03Z) - Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - VQAttack: Transferable Adversarial Attacks on Visual Question Answering
via Pre-trained Models [58.21452697997078]
We propose a novel VQAttack model, which can generate both image and text perturbations with the designed modules.
Experimental results on two VQA datasets with five validated models demonstrate the effectiveness of the proposed VQAttack.
arXiv Detail & Related papers (2024-02-16T21:17:42Z) - Masked Vision and Language Pre-training with Unimodal and Multimodal
Contrastive Losses for Medical Visual Question Answering [7.669872220702526]
We present a novel self-supervised approach that learns unimodal and multimodal feature representations of input images and text.
The proposed approach achieves state-of-the-art (SOTA) performance on three publicly available medical VQA datasets.
arXiv Detail & Related papers (2023-07-11T15:00:11Z) - Localized Questions in Medical Visual Question Answering [2.005299372367689]
Visual Question Answering (VQA) models aim to answer natural language questions about given images.
Existing medical VQA models typically focus on answering questions that refer to an entire image.
This paper proposes a novel approach for medical VQA that addresses this limitation by developing a model that can answer questions about image regions.
arXiv Detail & Related papers (2023-07-03T14:47:18Z) - PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering [56.25766322554655]
Medical Visual Question Answering (MedVQA) presents a significant opportunity to enhance diagnostic accuracy and healthcare delivery.
We propose a generative-based model for medical visual understanding by aligning visual information from a pre-trained vision encoder with a large language model.
We train the proposed model on PMC-VQA and then fine-tune it on multiple public benchmarks, e.g., VQA-RAD, SLAKE, and Image-Clef 2019.
arXiv Detail & Related papers (2023-05-17T17:50:16Z) - RAMM: Retrieval-augmented Biomedical Visual Question Answering with
Multi-modal Pre-training [45.38823400370285]
Vision-and-language multi-modal pretraining and fine-tuning have shown great success in visual question answering (VQA)
In this paper, we propose a retrieval-augmented pretrain-and-finetune paradigm named RAMM for biomedical VQA.
arXiv Detail & Related papers (2023-03-01T14:21:19Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - MuVAM: A Multi-View Attention-based Model for Medical Visual Question
Answering [2.413694065650786]
This paper proposes a multi-view attention-based model(MuVAM) for medical visual question answering.
It integrates the high-level semantics of medical images on the basis of text description.
Experiments on two datasets show that the effectiveness of MuVAM surpasses the state-of-the-art method.
arXiv Detail & Related papers (2021-07-07T13:40:25Z) - MMBERT: Multimodal BERT Pretraining for Improved Medical VQA [23.78515287446131]
We propose a solution inspired by self-supervised pretraining of Transformer-style architectures for NLP, Vision and Language tasks.
Our method involves learning richer medical image and text semantic representations using Masked Language Modeling.
The proposed solution achieves new state-of-the-art performance on two VQA datasets for radiology images.
arXiv Detail & Related papers (2021-04-03T13:01:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.