Related papers: HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale

HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale

URL: http://arxiv.org/abs/2406.19280v4
Date: Mon, 30 Sep 2024 06:45:16 GMT
Title: HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale
Authors: Junying Chen, Chi Gui, Ruyi Ouyang, Anningzhe Gao, Shunian Chen, Guiming Hardy Chen, Xidong Wang, Ruifei Zhang, Zhenyang Cai, Ke Ji, Guangjun Yu, Xiang Wan, Benyou Wang,
Abstract summary: We create the PubMedVision dataset with 1.3 million medical VQA samples. Using PubMedVision, we train a 34B medical MLLM HuatuoGPT-Vision, which shows superior performance in medical multimodal scenarios.
Score: 29.956053068653734
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rapid development of multimodal large language models (MLLMs), such as GPT-4V, has led to significant advancements. However, these models still face challenges in medical multimodal capabilities due to limitations in the quantity and quality of medical vision-text data, stemming from data privacy concerns and high annotation costs. While pioneering approaches utilize PubMed's large-scale, de-identified medical image-text pairs to address these limitations, they still fall short due to inherent data noise. To tackle this, we refined medical image-text pairs from PubMed and employed MLLMs (GPT-4V) in an 'unblinded' capacity to denoise and reformat the data, resulting in the creation of the PubMedVision dataset with 1.3 million medical VQA samples. Our validation demonstrates that: (1) PubMedVision can significantly enhance the medical multimodal capabilities of current MLLMs, showing significant improvement in benchmarks including the MMMU Health & Medicine track; (2) manual checks by medical experts and empirical results validate the superior data quality of our dataset compared to other data construction methods. Using PubMedVision, we train a 34B medical MLLM HuatuoGPT-Vision, which shows superior performance in medical multimodal scenarios among open-source MLLMs.

Related papers

Fleming-VL: Towards Universal Medical Visual Reasoning with Multimodal LLMs [7.542510160217106]
We propose Fleming-VL, a framework for comprehensive medical visual understanding across heterogeneous modalities.<n>Fleming-VL achieves state-of-the-art performance across multiple benchmarks, including medical VQA, video QA, and 3D medical image understanding.
arXiv Detail & Related papers (2025-11-02T12:30:22Z)
InfiMed-Foundation: Pioneering Advanced Multimodal Medical Models with Compute-Efficient Pre-Training and Multi-Stage Fine-Tuning [19.791150694039466]
InfiMed-Foundation-1.7B and InfiMed-Foundation-4B are two medical-specific MLLMs designed to deliver state-of-the-art performance in medical applications.<n>We employ low-to-high image resolution and multimodal sequence packing to enhance training efficiency.<n>InfiMed-Foundation-1.7B outperforms Qwen2.5VL-3B, while InfiMed-Foundation-4B surpasses HuatuoGPT-V-7B and MedGemma-27B-IT.
arXiv Detail & Related papers (2025-09-26T12:26:16Z)
MedGemma Technical Report [75.88152277443179]
We introduce MedGemma, a collection of medical vision-language foundation models based on Gemma 3 4B and 27B.<n>MedGemma demonstrates advanced medical understanding and reasoning on images and text.<n>We additionally introduce MedSigLIP, a medically-tuned vision encoder derived from SigLIP.
arXiv Detail & Related papers (2025-07-07T17:01:44Z)
Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning [57.873833577058]
We build a multimodal dataset enriched with extensive medical knowledge.<n>We then introduce our medical-specialized MLLM: Lingshu.<n>Lingshu undergoes multi-stage training to embed medical expertise and enhance its task-solving capabilities.
arXiv Detail & Related papers (2025-06-08T08:47:30Z)
InfiMed: Low-Resource Medical MLLMs with Advancing Understanding and Reasoning [19.791150694039466]
We introduce our InfiMed-Series models, InfiMed-SFT-3B and InfiMed-RL-3B, both of which deliver state-of-the-art performance across seven multimodal medical benchmarks.<n>InfiMed-RL-3B achieves an average accuracy of 59.2%, outperforming even larger models like InternVL3-8B, which achieves 57.3%.
arXiv Detail & Related papers (2025-05-29T10:31:57Z)
Medical Large Vision Language Models with Multi-Image Visual Ability [46.889345205047675]
We present the Med-MIM instruction dataset, comprising 83.2K medical multi-image QA pairs.<n>We fine-tune Mantis and LLaVA-Med, resulting in two specialized medical VLMs: MIM-LLaVA-Med and Med-Mantis.<n>We also develop the Med-MIM benchmark to evaluate the medical multi-image understanding capabilities of LVLMs.
arXiv Detail & Related papers (2025-05-25T08:31:22Z)
MEDMKG: Benchmarking Medical Knowledge Exploitation with Multimodal Knowledge Graph [28.79000907242469]
We propose MEDMKG, a Medical Multimodal Knowledge Graph that unifies visual and textual medical information through a multi-stage construction pipeline.<n>We evaluate MEDMKG across three tasks under two experimental settings, benchmarking twenty-four baseline methods and four state-of-the-art vision-language backbones on six datasets.<n>Results show that MEDMKG not only improves performance in downstream medical tasks but also offers a strong foundation for developing adaptive and robust strategies for multimodal knowledge integration in medical artificial intelligence.
arXiv Detail & Related papers (2025-05-22T18:41:46Z)
Improving Medical Large Vision-Language Models with Abnormal-Aware Feedback [57.98393950821579]
We propose a novel UMed-LVLM designed to unveil medical abnormalities.<n>We propose a prompt method utilizing the GPT-4V to generate diagnoses based on identified abnormal areas in medical images.<n>Our UMed-LVLM significantly outperforms existing Med-LVLMs in identifying and understanding medical abnormalities.
arXiv Detail & Related papers (2025-01-02T17:37:20Z)
UniMed-CLIP: Towards a Unified Image-Text Pretraining Paradigm for Diverse Medical Imaging Modalities [68.12889379702824]
Vision-Language Models (VLMs) trained via contrastive learning have achieved notable success in natural image tasks. UniMed is a large-scale, open-source multi-modal medical dataset comprising over 5.3 million image-text pairs. We trained UniMed-CLIP, a unified VLM for six modalities, achieving notable gains in zero-shot evaluations.
arXiv Detail & Related papers (2024-12-13T18:59:40Z)
GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI [34.80116091045628]
We develop GMAI-VL, a general medical vision-language model, with a three-stage training strategy. Experiments show that GMAI-VL achieves state-of-the-art performance across various multimodal medical tasks.
arXiv Detail & Related papers (2024-11-21T18:59:36Z)
A Survey of Medical Vision-and-Language Applications and Their Techniques [48.268198631277315]
Medical vision-and-language models (MVLMs) have attracted substantial interest due to their capability to offer a natural language interface for interpreting complex medical data. Here, we provide a comprehensive overview of MVLMs and the various medical tasks to which they have been applied. We also examine the datasets used for these tasks and compare the performance of different models based on standardized evaluation metrics.
arXiv Detail & Related papers (2024-11-19T03:27:05Z)
Parameter-Efficient Fine-Tuning Medical Multimodal Large Language Models for Medical Visual Grounding [9.144030136201476]
Multimodal large language models (MLLMs) inherit the superior text understanding capabilities of LLMs and extend these capabilities to multimodal scenarios. These models achieve excellent results in the general domain of multimodal tasks. However, in the medical domain, the substantial training costs and the requirement for extensive medical data pose challenges to the development of medical MLLMs.
arXiv Detail & Related papers (2024-10-31T11:07:26Z)
LLaVA-Ultra: Large Chinese Language and Vision Assistant for Ultrasound [7.941670191244354]
We propose a fine-grained adaptive VLM architecture for Chinese medical visual conversations through parameter-efficient tuning. Specifically, we devise a fusion module with fine-grained vision encoders to achieve enhancement for subtle medical visual semantics. For execution, we leverage a large-scale multimodal Chinese ultrasound dataset obtained from the hospital.
arXiv Detail & Related papers (2024-10-19T11:38:31Z)
MediConfusion: Can you trust your AI radiologist? Probing the reliability of multimodal medical foundation models [20.781551849965357]
We introduce MediConfusion, a challenging medical Visual Question Answering (VQA) benchmark dataset. We reveal that state-of-the-art models are easily confused by image pairs that are otherwise visually dissimilar and clearly distinct for medical experts. We also extract common patterns of model failure that may help the design of a new generation of more trustworthy and reliable MLLMs in healthcare.
arXiv Detail & Related papers (2024-09-23T18:59:37Z)
STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical Question-Answering [58.79671189792399]
STLLaVA-Med is designed to train a policy model capable of auto-generating medical visual instruction data. We validate the efficacy and data efficiency of STLLaVA-Med across three major medical Visual Question Answering (VQA) benchmarks.
arXiv Detail & Related papers (2024-06-28T15:01:23Z)
Capabilities of Gemini Models in Medicine [100.60391771032887]
We introduce Med-Gemini, a family of highly capable multimodal models specialized in medicine. We evaluate Med-Gemini on 14 medical benchmarks, establishing new state-of-the-art (SoTA) performance on 10 of them. Our results offer compelling evidence for Med-Gemini's potential, although further rigorous evaluation will be crucial before real-world deployment.
arXiv Detail & Related papers (2024-04-29T04:11:28Z)
RJUA-MedDQA: A Multimodal Benchmark for Medical Document Question Answering and Clinical Reasoning [14.366349078707263]
RJUA-MedDQA is a comprehensive benchmark in the field of medical specialization. This work introduces RJUA-MedDQA, a comprehensive benchmark in the field of medical specialization.
arXiv Detail & Related papers (2024-02-19T06:57:02Z)
OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM [48.16696073640864]
We introduce OmniMedVQA, a novel comprehensive medical Visual Question Answering (VQA) benchmark. All images in this benchmark are sourced from authentic medical scenarios. We have found that existing LVLMs struggle to address these medical VQA problems effectively.
arXiv Detail & Related papers (2024-02-14T13:51:56Z)
LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets. We have collected approximately 1.3 million medical images from 55 publicly available datasets. LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z)
Customizing General-Purpose Foundation Models for Medical Report Generation [64.31265734687182]
The scarcity of labelled medical image-report pairs presents great challenges in the development of deep and large-scale neural networks. We propose customizing off-the-shelf general-purpose large-scale pre-trained models, i.e., foundation models (FMs) in computer vision and natural language processing.
arXiv Detail & Related papers (2023-06-09T03:02:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.