Hulu-Med: A Transparent Generalist Model towards Holistic Medical Vision-Language Understanding
- URL: http://arxiv.org/abs/2510.08668v1
- Date: Thu, 09 Oct 2025 17:06:42 GMT
- Title: Hulu-Med: A Transparent Generalist Model towards Holistic Medical Vision-Language Understanding
- Authors: Songtao Jiang, Yuan Wang, Sibo Song, Tianxiang Hu, Chenyi Zhou, Bin Pu, Yan Zhang, Zhibo Yang, Yang Feng, Joey Tianyi Zhou, Jin Hao, Zijian Chen, Ruijia Wu, Tao Tang, Junhui Lv, Hongxia Xu, Hongwei Wang, Jun Xiao, Bin Feng, Fudong Zhu, Kenli Li, Weidi Xie, Jimeng Sun, Jian Wu, Zuozhu Liu,
- Abstract summary: We present Hulu-Med, a transparent medical VLM that unifies understanding across all these modalities.<n>Built upon a unified patch-based vision encoder and an LLM decoder, Hulu-Med was progressively trained on 16.7 million (M) samples to scale from 2D to 3D and video comprehension.
- Score: 112.46150793476603
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Real-world clinical decision-making grapples with integrating information from diverse data modalities, including medical text, 2D/3D images, and video, leading to inefficiencies and potential diagnostic oversights. While generalist vision-language models (VLMs) offer promise, their medical development faces challenges of opaque pipelines, data scarcity, and architectural inflexibility. Here we present Hulu-Med, a transparent medical VLM that unifies understanding across all these modalities. Built upon a unified patch-based vision encoder and an LLM decoder, Hulu-Med was progressively trained on 16.7 million (M) samples to scale from 2D to 3D and video comprehension. The medical-aware token reduction enables efficient training, requiring only 4,000 to 40,000 GPU hours for 7B to 32B parameter variants. Extensive evaluation across 30 benchmarks exhibits state-of-the-art performance, surpassing leading open-source models and competing with proprietary systems in tasks spanning visual question-answering, medical report generation, and complex reasoning in multilingual and rare disease scenarios. By open-sourcing our complete pipeline, we establish that high-performance medical VLM can be achieved transparently, providing a foundational tool for accessible and impactful clinical AI. Code is released on \href{https://github.com/ZJUI-AI4H/Hulu-Med}{https://github.com/ZJUI-AI4H/Hulu-Med}.
Related papers
- MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images [25.29568841502814]
We introduce MedMO, a medical foundation model built upon a generalized MLLM architecture.<n>On VQA benchmarks, MedMO achieves an average accuracy improvement of +13.7% over the baseline.<n>In medical report generation, MedMO delivers significant gains in both semantic and clinical accuracy.
arXiv Detail & Related papers (2026-02-06T18:59:59Z) - DuPLUS: Dual-Prompt Vision-Language Framework for Universal Medical Image Segmentation and Prognosis [5.494301428436596]
We introduce DuPLUS, a deep learning framework for efficient multi-modal medical image analysis.<n>DuPLUS introduces a novel vision-language framework that leverages hierarchical semantic prompts for fine-grained control over the analysis task.<n>For segmentation, DuPLUS is able to generalize across three imaging modalities, ten different various medical datasets, encompassing more than 30 organs and tumor types.
arXiv Detail & Related papers (2025-10-03T20:01:00Z) - MedGen: Unlocking Medical Video Generation by Scaling Granularly-annotated Medical Videos [16.86256309424395]
We introduce MedVideoCap-55K, the first large-scale, diverse, and caption-rich dataset for medical video generation.<n>It comprises over 55,000 curated clips spanning real-world medical scenarios.<n>Built upon this dataset, we develop MedGen, which achieves leading performance among open-source models.
arXiv Detail & Related papers (2025-07-08T04:58:36Z) - MedGemma Technical Report [75.88152277443179]
We introduce MedGemma, a collection of medical vision-language foundation models based on Gemma 3 4B and 27B.<n>MedGemma demonstrates advanced medical understanding and reasoning on images and text.<n>We additionally introduce MedSigLIP, a medically-tuned vision encoder derived from SigLIP.
arXiv Detail & Related papers (2025-07-07T17:01:44Z) - Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning [57.873833577058]
We build a multimodal dataset enriched with extensive medical knowledge.<n>We then introduce our medical-specialized MLLM: Lingshu.<n>Lingshu undergoes multi-stage training to embed medical expertise and enhance its task-solving capabilities.
arXiv Detail & Related papers (2025-06-08T08:47:30Z) - OmniV-Med: Scaling Medical Vision-Language Model for Universal Visual Understanding [35.35197484810533]
We present OmniV-Med, a unified framework for multimodal medical understanding.<n>We devise a rotary position-adaptive encoder that processes multi-resolution 2D/3D images and videos within a unified architecture.<n>We introduce a medical-aware token pruning mechanism that exploits spatial-temporal redundancy in volumetric data and medical videos.
arXiv Detail & Related papers (2025-04-20T17:53:56Z) - HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale [29.956053068653734]
We create the PubMedVision dataset with 1.3 million medical VQA samples.
Using PubMedVision, we train a 34B medical MLLM HuatuoGPT-Vision, which shows superior performance in medical multimodal scenarios.
arXiv Detail & Related papers (2024-06-27T15:50:41Z) - Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - Med-Flamingo: a Multimodal Medical Few-shot Learner [58.85676013818811]
We propose Med-Flamingo, a multimodal few-shot learner adapted to the medical domain.
Based on OpenFlamingo-9B, we continue pre-training on paired and interleaved medical image-text data from publications and textbooks.
We conduct the first human evaluation for generative medical VQA where physicians review the problems and blinded generations in an interactive app.
arXiv Detail & Related papers (2023-07-27T20:36:02Z) - Customizing General-Purpose Foundation Models for Medical Report
Generation [64.31265734687182]
The scarcity of labelled medical image-report pairs presents great challenges in the development of deep and large-scale neural networks.
We propose customizing off-the-shelf general-purpose large-scale pre-trained models, i.e., foundation models (FMs) in computer vision and natural language processing.
arXiv Detail & Related papers (2023-06-09T03:02:36Z) - LLaVA-Med: Training a Large Language-and-Vision Assistant for
Biomedicine in One Day [85.19963303642427]
We propose a cost-efficient approach for training a vision-language conversational assistant that can answer open-ended research questions of biomedical images.
The model first learns to align biomedical vocabulary using the figure-caption pairs as is, then learns to master open-ended conversational semantics.
This enables us to train a Large Language and Vision Assistant for BioMedicine in less than 15 hours (with eight A100s)
arXiv Detail & Related papers (2023-06-01T16:50:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.