LLaVA-Med: Training a Large Language-and-Vision Assistant for
Biomedicine in One Day
- URL: http://arxiv.org/abs/2306.00890v1
- Date: Thu, 1 Jun 2023 16:50:07 GMT
- Title: LLaVA-Med: Training a Large Language-and-Vision Assistant for
Biomedicine in One Day
- Authors: Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu,
Jianwei Yang, Tristan Naumann, Hoifung Poon, Jianfeng Gao
- Abstract summary: We propose a cost-efficient approach for training a vision-language conversational assistant that can answer open-ended research questions of biomedical images.
The model first learns to align biomedical vocabulary using the figure-caption pairs as is, then learns to master open-ended conversational semantics.
This enables us to train a Large Language and Vision Assistant for BioMedicine in less than 15 hours (with eight A100s)
- Score: 85.19963303642427
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Conversational generative AI has demonstrated remarkable promise for
empowering biomedical practitioners, but current investigations focus on
unimodal text. Multimodal conversational AI has seen rapid progress by
leveraging billions of image-text pairs from the public web, but such
general-domain vision-language models still lack sophistication in
understanding and conversing about biomedical images. In this paper, we propose
a cost-efficient approach for training a vision-language conversational
assistant that can answer open-ended research questions of biomedical images.
The key idea is to leverage a large-scale, broad-coverage biomedical
figure-caption dataset extracted from PubMed Central, use GPT-4 to
self-instruct open-ended instruction-following data from the captions, and then
fine-tune a large general-domain vision-language model using a novel curriculum
learning method. Specifically, the model first learns to align biomedical
vocabulary using the figure-caption pairs as is, then learns to master
open-ended conversational semantics using GPT-4 generated instruction-following
data, broadly mimicking how a layperson gradually acquires biomedical
knowledge. This enables us to train a Large Language and Vision Assistant for
BioMedicine (LLaVA-Med) in less than 15 hours (with eight A100s). LLaVA-Med
exhibits excellent multimodal conversational capability and can follow
open-ended instruction to assist with inquiries about a biomedical image. On
three standard biomedical visual question answering datasets, LLaVA-Med
outperforms previous supervised state-of-the-art on certain metrics. To
facilitate biomedical multimodal research, we will release our
instruction-following data and the LLaVA-Med model.
Related papers
- STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical Question-Answering [58.79671189792399]
STLLaVA-Med is designed to train a policy model capable of auto-generating medical visual instruction data.
We validate the efficacy and data efficiency of STLLaVA-Med across three major medical Visual Question Answering (VQA) benchmarks.
arXiv Detail & Related papers (2024-06-28T15:01:23Z) - A Refer-and-Ground Multimodal Large Language Model for Biomedicine [10.519866875035003]
The Med-GRIT-270k dataset is the first dedicated to the biomedical domain and integrates refer and ground conversations.
We introduce a Refer-and-Ground Multimodal Large Language Model for Biomedicine (BiRD) by using this dataset and multi-task instruction learning.
arXiv Detail & Related papers (2024-06-26T07:56:17Z) - Advancing High Resolution Vision-Language Models in Biomedicine [4.514292200785639]
We present a new instruct dataset enriched with medical image-text pairs from Claude3-Opus and LLaMA3 70B.
We develop the Llama3-Med model, which achieves state-of-the-art zero-shot performance on biomedical visual question answering benchmarks.
arXiv Detail & Related papers (2024-06-12T18:29:26Z) - Diversifying Knowledge Enhancement of Biomedical Language Models using
Adapter Modules and Knowledge Graphs [54.223394825528665]
We develop an approach that uses lightweight adapter modules to inject structured biomedical knowledge into pre-trained language models.
We use two large KGs, the biomedical knowledge system UMLS and the novel biochemical OntoChem, with two prominent biomedical PLMs, PubMedBERT and BioLinkBERT.
We show that our methodology leads to performance improvements in several instances while keeping requirements in computing power low.
arXiv Detail & Related papers (2023-12-21T14:26:57Z) - BioLORD-2023: Semantic Textual Representations Fusing LLM and Clinical
Knowledge Graph Insights [15.952942443163474]
We propose a new state-of-the-art approach for obtaining high-fidelity representations of biomedical concepts and sentences.
We demonstrate consistent and substantial performance improvements over the previous state of the art.
Besides our new state-of-the-art biomedical model for English, we also distill and release a multilingual model compatible with 50+ languages.
arXiv Detail & Related papers (2023-11-27T18:46:17Z) - XrayGPT: Chest Radiographs Summarization using Medical Vision-Language
Models [60.437091462613544]
We introduce XrayGPT, a novel conversational medical vision-language model.
It can analyze and answer open-ended questions about chest radiographs.
We generate 217k interactive and high-quality summaries from free-text radiology reports.
arXiv Detail & Related papers (2023-06-13T17:59:59Z) - BiomedCLIP: a multimodal biomedical foundation model pretrained from
fifteen million scientific image-text pairs [48.376109878173956]
We present PMC-15M, a novel dataset that is two orders of magnitude larger than existing biomedical multimodal datasets.
PMC-15M contains 15 million biomedical image-text pairs collected from 4.4 million scientific articles.
Based on PMC-15M, we have pretrained BiomedCLIP, a multimodal foundation model, with domain-specific adaptations tailored to biomedical vision-language processing.
arXiv Detail & Related papers (2023-03-02T02:20:04Z) - Align, Reason and Learn: Enhancing Medical Vision-and-Language
Pre-training with Knowledge [68.90835997085557]
We propose a systematic and effective approach to enhance structured medical knowledge from three perspectives.
First, we align the representations of the vision encoder and the language encoder through knowledge.
Second, we inject knowledge into the multi-modal fusion model to enable the model to perform reasoning using knowledge as the supplementation of the input image and text.
Third, we guide the model to put emphasis on the most critical information in images and texts by designing knowledge-induced pretext tasks.
arXiv Detail & Related papers (2022-09-15T08:00:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.