Fugu-MT 論文翻訳(概要): EVLF-FM: Explainable Vision Language Foundation Model for Medicine

論文の概要: EVLF-FM: Explainable Vision Language Foundation Model for Medicine

arxiv url: http://arxiv.org/abs/2509.24231v1
Date: Mon, 29 Sep 2025 03:15:57 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:19.718363
Title: EVLF-FM: Explainable Vision Language Foundation Model for Medicine
Title（参考訳）: EVLF-FM: 説明可能な医用ビジョン言語基盤モデル
Authors: Yang Bai, Haoran Cheng, Yang Zhou, Jun Zhou, Arun Thirunavukarasu, Yuhe Ke, Jie Yao, Kanae Fukutsu, Chrystie Wan Ning Quek, Ashley Hong, Laura Gutierrez, Zhen Ling Teo, Darren Shu Jeng Ting, Brian T. Soetikno, Christopher S. Nielsen, Tobias Elze, Zengxiang Li, Linh Le Dinh, Hiok Hong Chan, Victor Koh, Marcus Tan, Kelvin Z. Li, Leonard Yip, Ching Yu Cheng, Yih Chung Tham, Gavin Siew Wei Tan, Leopold Schmetterer, Marcus Ang, Rahat Hussain, Jod Mehta, Tin Aung, Lionel Tim-Ee Cheng, Tran Nguyen Tuan Anh, Chee Leong Cheng, Tien Yin Wong, Nan Liu, Iain Beehuat Tan, Soon Thye Lim, Eyal Klang, Tony Kiat Hon Lim, Rick Siow Mong Goh, Yong Liu, Daniel Shu Wei Ting,
Abstract要約: 本稿では,多モード視覚言語基盤モデルEVLF-FMについて述べる。 EVLF-FMの開発とテストは、23のグローバルデータセットから13万以上のサンプルを含んでいた。疾患診断のための内部検証では、EVLF-FMは最高平均精度0.858とF1スコア0.797を達成した。医学的な視覚的グラウンドでは、EVLF-FMは平均mIOU 0.743、Acc@0.5 0.837の9つのモードで恒星の性能を達成した。
参考スコア（独自算出の注目度）: 26.787109735346103
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Despite the promise of foundation models in medical AI, current systems remain limited - they are modality-specific and lack transparent reasoning processes, hindering clinical adoption. To address this gap, we present EVLF-FM, a multimodal vision-language foundation model (VLM) designed to unify broad diagnostic capability with fine-grain explainability. The development and testing of EVLF-FM encompassed over 1.3 million total samples from 23 global datasets across eleven imaging modalities related to six clinical specialties: dermatology, hepatology, ophthalmology, pathology, pulmonology, and radiology. External validation employed 8,884 independent test samples from 10 additional datasets across five imaging modalities. Technically, EVLF-FM is developed to assist with multiple disease diagnosis and visual question answering with pixel-level visual grounding and reasoning capabilities. In internal validation for disease diagnostics, EVLF-FM achieved the highest average accuracy (0.858) and F1-score (0.797), outperforming leading generalist and specialist models. In medical visual grounding, EVLF-FM also achieved stellar performance across nine modalities with average mIOU of 0.743 and Acc@0.5 of 0.837. External validations further confirmed strong zero-shot and few-shot performance, with competitive F1-scores despite a smaller model size. Through a hybrid training strategy combining supervised and visual reinforcement fine-tuning, EVLF-FM not only achieves state-of-the-art accuracy but also exhibits step-by-step reasoning, aligning outputs with visual evidence. EVLF-FM is an early multi-disease VLM model with explainability and reasoning capabilities that could advance adoption of and trust in foundation models for real-world clinical deployment.
Abstract（参考訳）: 医療AIの基盤モデルが約束されているにもかかわらず、現在のシステムは限定的であり、モダリティに特化しており、透明な推論プロセスが欠如しており、臨床導入を妨げる。このギャップに対処するため,多モード視覚言語基盤モデル(VLM)であるEVLF-FMを提案する。 EVLF-FMの開発と試験は、皮膚科、肝学、眼科、病理学、肺学、放射線学の6つの臨床専門分野に関連する11の画像モダリティにわたる23のグローバルデータセットから13万以上のサンプルを収集した。外部検証では、5つの画像モダリティにまたがる10のデータセットから8,884の独立したテストサンプルを使用した。 EVLF-FMは、複数の疾患の診断と、ピクセルレベルの視覚的接地と推論機能による視覚的質問応答を支援するために開発された。疾患診断のための内部検証において、EVLF-FMは最高平均精度(0.858)とF1スコア(0.797)を達成した。医学的な視覚的グラウンドでは、EVLF-FMは平均mIOU 0.743、Acc@0.5 0.837の9つのモードで恒星の性能を達成した。外部検証では、モデルサイズが小さいにもかかわらずF1スコアの強力なゼロショットと少数ショットのパフォーマンスが確認された。教師付きと視覚的強化の微調整を組み合わせたハイブリッドトレーニング戦略を通じて、EVLF-FMは最先端の精度を達成するだけでなく、ステップバイステップの推論を示し、出力を視覚的証拠と整合させる。 EVLF-FMは、説明可能性と推論能力を備えた初期のマルチリリースVLMモデルであり、実際の臨床展開のための基礎モデルの採用と信頼を促進することができる。

論文の概要: EVLF-FM: Explainable Vision Language Foundation Model for Medicine

関連論文リスト