Medical Vision Language Pretraining: A survey
- URL: http://arxiv.org/abs/2312.06224v1
- Date: Mon, 11 Dec 2023 09:14:13 GMT
- Title: Medical Vision Language Pretraining: A survey
- Authors: Prashant Shrestha, Sanskar Amgain, Bidur Khanal, Cristian A. Linte,
Binod Bhattarai
- Abstract summary: Medical Vision Language Pretraining is a promising solution to the scarcity of labeled data in the medical domain.
By leveraging paired/unpaired vision and text datasets through self-supervised learning, models can be trained to acquire vast knowledge and learn robust feature representations.
- Score: 8.393439175704124
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Medical Vision Language Pretraining (VLP) has recently emerged as a promising
solution to the scarcity of labeled data in the medical domain. By leveraging
paired/unpaired vision and text datasets through self-supervised learning,
models can be trained to acquire vast knowledge and learn robust feature
representations. Such pretrained models have the potential to enhance multiple
downstream medical tasks simultaneously, reducing the dependency on labeled
data. However, despite recent progress and its potential, there is no such
comprehensive survey paper that has explored the various aspects and
advancements in medical VLP. In this paper, we specifically review existing
works through the lens of different pretraining objectives, architectures,
downstream evaluation tasks, and datasets utilized for pretraining and
downstream tasks. Subsequently, we delve into current challenges in medical
VLP, discussing existing and potential solutions, and conclude by highlighting
future directions. To the best of our knowledge, this is the first survey
focused on medical VLP.
Related papers
- STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical Question-Answering [58.79671189792399]
STLLaVA-Med is designed to train a policy model capable of auto-generating medical visual instruction data.
We validate the efficacy and data efficiency of STLLaVA-Med across three major medical Visual Question Answering (VQA) benchmarks.
arXiv Detail & Related papers (2024-06-28T15:01:23Z) - Dr-LLaVA: Visual Instruction Tuning with Symbolic Clinical Grounding [53.629132242389716]
Vision-Language Models (VLM) can support clinicians by analyzing medical images and engaging in natural language interactions.
VLMs often exhibit "hallucinogenic" behavior, generating textual outputs not grounded in contextual multimodal information.
We propose a new alignment algorithm that uses symbolic representations of clinical reasoning to ground VLMs in medical knowledge.
arXiv Detail & Related papers (2024-05-29T23:19:28Z) - Vision-Language Models for Medical Report Generation and Visual Question Answering: A Review [0.0]
Medical vision-language models (VLMs) combine computer vision (CV) and natural language processing (NLP) to analyze medical data.
Our paper reviews recent advancements in developing models designed for medical report generation and visual question answering.
arXiv Detail & Related papers (2024-03-04T20:29:51Z) - Med-Flamingo: a Multimodal Medical Few-shot Learner [58.85676013818811]
We propose Med-Flamingo, a multimodal few-shot learner adapted to the medical domain.
Based on OpenFlamingo-9B, we continue pre-training on paired and interleaved medical image-text data from publications and textbooks.
We conduct the first human evaluation for generative medical VQA where physicians review the problems and blinded generations in an interactive app.
arXiv Detail & Related papers (2023-07-27T20:36:02Z) - Masked Vision and Language Pre-training with Unimodal and Multimodal
Contrastive Losses for Medical Visual Question Answering [7.669872220702526]
We present a novel self-supervised approach that learns unimodal and multimodal feature representations of input images and text.
The proposed approach achieves state-of-the-art (SOTA) performance on three publicly available medical VQA datasets.
arXiv Detail & Related papers (2023-07-11T15:00:11Z) - Multi-modal Pre-training for Medical Vision-language Understanding and
Generation: An Empirical Study with A New Benchmark [12.565598914787834]
We propose RadioGraphy Captions (RGC), a high-quality, multi-modality radiographic dataset containing 18,434 image-caption pairs.
RGC can be used as a pre-training dataset or a new benchmark for medical report generation and medical image-text retrieval.
arXiv Detail & Related papers (2023-06-10T17:27:33Z) - PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering [56.25766322554655]
Medical Visual Question Answering (MedVQA) presents a significant opportunity to enhance diagnostic accuracy and healthcare delivery.
We propose a generative-based model for medical visual understanding by aligning visual information from a pre-trained vision encoder with a large language model.
We train the proposed model on PMC-VQA and then fine-tune it on multiple public benchmarks, e.g., VQA-RAD, SLAKE, and Image-Clef 2019.
arXiv Detail & Related papers (2023-05-17T17:50:16Z) - Privacy-preserving machine learning for healthcare: open challenges and
future perspectives [72.43506759789861]
We conduct a review of recent literature concerning Privacy-Preserving Machine Learning (PPML) for healthcare.
We primarily focus on privacy-preserving training and inference-as-a-service.
The aim of this review is to guide the development of private and efficient ML models in healthcare.
arXiv Detail & Related papers (2023-03-27T19:20:51Z) - Align, Reason and Learn: Enhancing Medical Vision-and-Language
Pre-training with Knowledge [68.90835997085557]
We propose a systematic and effective approach to enhance structured medical knowledge from three perspectives.
First, we align the representations of the vision encoder and the language encoder through knowledge.
Second, we inject knowledge into the multi-modal fusion model to enable the model to perform reasoning using knowledge as the supplementation of the input image and text.
Third, we guide the model to put emphasis on the most critical information in images and texts by designing knowledge-induced pretext tasks.
arXiv Detail & Related papers (2022-09-15T08:00:01Z) - VLP: A Survey on Vision-Language Pre-training [24.093731037295502]
The emergence of pre-training models has brought uni-modal fields such as computer vision (CV) and natural language processing (NLP) to a new era.
This paper surveys recent advances and new frontiers in vision-language pre-training, including image-text and video-text pre-training.
arXiv Detail & Related papers (2022-02-18T07:54:02Z) - Multilingual Medical Question Answering and Information Retrieval for
Rural Health Intelligence Access [1.0499611180329804]
In rural regions of several developing countries, access to quality healthcare, medical infrastructure, and professional diagnosis is largely unavailable.
Several deaths resulting from this lack of medical access, absence of patient's previous health records, and the supplanting of information in indigenous languages can be easily prevented.
We describe an approach leveraging the phenomenal progress in Machine Learning and NLP (Natural Language Processing) techniques to design a model that is low-resource, multilingual, and a preliminary first-point-of-contact medical assistant.
arXiv Detail & Related papers (2021-06-02T16:05:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.