Fugu-MT 論文翻訳(概要): Hulu-Med: A Transparent Generalist Model towards Holistic Medical Vision-Language Understanding

論文の概要: Hulu-Med: A Transparent Generalist Model towards Holistic Medical Vision-Language Understanding

arxiv url: http://arxiv.org/abs/2510.08668v2
Date: Wed, 05 Nov 2025 15:19:13 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-06 16:07:39.990346
Title: Hulu-Med: A Transparent Generalist Model towards Holistic Medical Vision-Language Understanding
Title（参考訳）: Hulu-Med:全体像理解のための透明なジェネリストモデル
Authors: Songtao Jiang, Yuan Wang, Sibo Song, Tianxiang Hu, Chenyi Zhou, Bin Pu, Yan Zhang, Zhibo Yang, Yang Feng, Joey Tianyi Zhou, Jin Hao, Zijian Chen, Ruijia Wu, Tao Tang, Junhui Lv, Hongxia Xu, Hongwei Wang, Jun Xiao, Bin Feng, Fudong Zhu, Kenli Li, Weidi Xie, Jimeng Sun, Jian Wu, Zuozhu Liu,
Abstract要約: 透明で汎用的な医用ビジョンランゲージモデル(VLM)であるHulu-Medを紹介する。 Hulu-Medは1670万サンプルのキュレートされたコーパスで訓練されており、12の解剖学的システムと14の医用画像モダリティにまたがっている。 Hulu-Medは、30ベンチマーク中27ベンチマークで既存のオープンソースモデルを上回っ、16ベンチマークでGPT-4oなどのプロプライエタリシステムを上回っている。
参考スコア（独自算出の注目度）: 112.46150793476603
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Real-world clinical decision-making requires integrating heterogeneous data, including medical text, 2D images, 3D volumes, and videos, while existing AI systems fail to unify all these signals, limiting their utility. In this paper, we introduce Hulu-Med, a transparent, generalist medical Vision-Language Model (VLM) designed to unify language-only, 2D/3D vision-language, and video understanding within a single architecture. Hulu-Med is trained on a curated corpus of 16.7 million samples, comprising exclusively public or synthetic data, spanning 12 major anatomical systems and 14 medical imaging modalities. Hulu-Med employs a medical-aware token-reduction strategy that prunes redundant visual tokens, achieving up to a 55% reduction for 3D and video inputs, improving cross-modal efficiency, and enabling training at 7B-32B parameter scales in approximately 4,000-40,000 GPU hours. Across 30 public in-domain and out-of-domain medical benchmarks-covering text reasoning, visual question answering, report generation, multilingual dialogue, video understanding, and rare disease diagnosis-Hulu-Med surpasses existing open-source models on 27 of 30 benchmarks and outperforms proprietary systems such as GPT-4o on 16 benchmarks. Despite being a VLM, Hulu-Med outperforms GPT-4o and matches GPT-o1 on the text-only HealthBench. For the first time in the community, we provide a fully transparent, reproducible and cost-effective pipeline for holistic medical vision-language understanding by releasing our end-to-end data curation, training procedures, and model parameters. Code and models are available at https://github.com/ZJUI-AI4H/Hulu-Med.
Abstract（参考訳）: 現実の臨床的意思決定には、医療用テキスト、2D画像、3Dボリューム、ビデオを含む異種データを統合する必要がある。本稿では, 言語のみ, 2D/3Dビジョン言語, ビデオ理解を単一のアーキテクチャで統一する, 透明で汎用的な医用ビジョンランゲージモデル (VLM) であるHulu-Medを紹介する。 Hulu-Medは、12の解剖学的システムと14の医用画像モダリティにまたがる、公開または合成データのみを含む1670万サンプルのキュレートされたコーパスで訓練されている。 Hulu-Medは、冗長なビジュアルトークンを抽出し、3Dおよびビデオ入力を最大55%削減し、モダル間効率を改善し、約4000～40,000GPU時間で7B-32Bパラメータのトレーニングを可能にする、医療対応のトークン還元戦略を採用している。 30のパブリックドメイン内および外部の医療ベンチマーク - テキスト推論、視覚的質問応答、レポート生成、多言語対話、ビデオ理解、まれな疾患診断を含む。Hulu-Medは、30のベンチマークの27で既存のオープンソースモデルを超え、16のベンチマークでGPT-4oのようなプロプライエタリシステムを上回っている。 VLMであるにもかかわらず、Hulu-MedはGPT-4oを上回り、テキストのみのHealthBenchでGPT-o1とマッチする。コミュニティで初めて、エンド・ツー・エンドのデータキュレーション、トレーニング手順、モデルパラメータをリリースすることによって、総合的な医療ビジョン言語理解のための、完全に透明で再現可能で費用対効果の高いパイプラインを提供しました。コードとモデルはhttps://github.com/ZJUI-AI4H/Hulu-Med.comで入手できる。

論文の概要: Hulu-Med: A Transparent Generalist Model towards Holistic Medical Vision-Language Understanding

関連論文リスト