Med-Flamingo: a Multimodal Medical Few-shot Learner
- URL: http://arxiv.org/abs/2307.15189v1
- Date: Thu, 27 Jul 2023 20:36:02 GMT
- Title: Med-Flamingo: a Multimodal Medical Few-shot Learner
- Authors: Michael Moor, Qian Huang, Shirley Wu, Michihiro Yasunaga, Cyril Zakka,
Yash Dalmia, Eduardo Pontes Reis, Pranav Rajpurkar, Jure Leskovec
- Abstract summary: We propose Med-Flamingo, a multimodal few-shot learner adapted to the medical domain.
Based on OpenFlamingo-9B, we continue pre-training on paired and interleaved medical image-text data from publications and textbooks.
We conduct the first human evaluation for generative medical VQA where physicians review the problems and blinded generations in an interactive app.
- Score: 58.85676013818811
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Medicine, by its nature, is a multifaceted domain that requires the synthesis
of information across various modalities. Medical generative vision-language
models (VLMs) make a first step in this direction and promise many exciting
clinical applications. However, existing models typically have to be fine-tuned
on sizeable down-stream datasets, which poses a significant limitation as in
many medical applications data is scarce, necessitating models that are capable
of learning from few examples in real-time. Here we propose Med-Flamingo, a
multimodal few-shot learner adapted to the medical domain. Based on
OpenFlamingo-9B, we continue pre-training on paired and interleaved medical
image-text data from publications and textbooks. Med-Flamingo unlocks few-shot
generative medical visual question answering (VQA) abilities, which we evaluate
on several datasets including a novel challenging open-ended VQA dataset of
visual USMLE-style problems. Furthermore, we conduct the first human evaluation
for generative medical VQA where physicians review the problems and blinded
generations in an interactive app. Med-Flamingo improves performance in
generative medical VQA by up to 20\% in clinician's rating and firstly enables
multimodal medical few-shot adaptations, such as rationale generation. We
release our model, code, and evaluation app under
https://github.com/snap-stanford/med-flamingo.
Related papers
- LoGra-Med: Long Context Multi-Graph Alignment for Medical Vision-Language Model [55.80651780294357]
State-of-the-art medical multi-modal large language models (med-MLLM) leverage instruction-following data in pre-training.
LoGra-Med is a new multi-graph alignment algorithm that enforces triplet correlations across image modalities, conversation-based descriptions, and extended captions.
Our results show LoGra-Med matches LLAVA-Med performance on 600K image-text pairs for Medical VQA and significantly outperforms it when trained on 10% of the data.
arXiv Detail & Related papers (2024-10-03T15:52:03Z) - FedMedICL: Towards Holistic Evaluation of Distribution Shifts in Federated Medical Imaging [68.6715007665896]
FedMedICL is a unified framework and benchmark to holistically evaluate federated medical imaging challenges.
We comprehensively evaluate several popular methods on six diverse medical imaging datasets.
We find that a simple batch balancing technique surpasses advanced methods in average performance across FedMedICL experiments.
arXiv Detail & Related papers (2024-07-11T19:12:23Z) - Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM [48.16696073640864]
We introduce OmniMedVQA, a novel comprehensive medical Visual Question Answering (VQA) benchmark.
All images in this benchmark are sourced from authentic medical scenarios.
We have found that existing LVLMs struggle to address these medical VQA problems effectively.
arXiv Detail & Related papers (2024-02-14T13:51:56Z) - Visual Question Answering in the Medical Domain [13.673890873313354]
We present a novel contrastive learning pretraining method to mitigate the problem of small datasets for the Med-VQA task.
Our proposed model obtained an accuracy of 60% on the VQA-Med 2019 test set, giving comparable results to other state-of-the-art Med-VQA models.
arXiv Detail & Related papers (2023-09-20T06:06:10Z) - Masked Vision and Language Pre-training with Unimodal and Multimodal
Contrastive Losses for Medical Visual Question Answering [7.669872220702526]
We present a novel self-supervised approach that learns unimodal and multimodal feature representations of input images and text.
The proposed approach achieves state-of-the-art (SOTA) performance on three publicly available medical VQA datasets.
arXiv Detail & Related papers (2023-07-11T15:00:11Z) - Towards Medical Artificial General Intelligence via Knowledge-Enhanced
Multimodal Pretraining [121.89793208683625]
Medical artificial general intelligence (MAGI) enables one foundation model to solve different medical tasks.
We propose a new paradigm called Medical-knedge-enhanced mulTimOdal pretRaining (MOTOR)
arXiv Detail & Related papers (2023-04-26T01:26:19Z) - Large Language Models Encode Clinical Knowledge [21.630872464930587]
Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation.
We propose a framework for human evaluation of model answers along multiple axes including factuality, precision, possible harm, and bias.
We show that comprehension, recall of knowledge, and medical reasoning improve with model scale and instruction prompt tuning.
arXiv Detail & Related papers (2022-12-26T14:28:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.