Fugu-MT 論文翻訳(概要): Are Multimodal LLMs Ready for Clinical Dermatology? A Real-World Evaluation in Dermatology

論文の概要: Are Multimodal LLMs Ready for Clinical Dermatology? A Real-World Evaluation in Dermatology

arxiv url: http://arxiv.org/abs/2605.04098v1
Date: Fri, 01 May 2026 02:54:07 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-07 18:41:07.431767
Title: Are Multimodal LLMs Ready for Clinical Dermatology? A Real-World Evaluation in Dermatology
Title（参考訳）: 臨床皮膚科領域におけるマルチモーダルLCMの適応 : 皮膚科領域における実世界評価
Authors: Roy Jiang, Hyunjae Kim, Zhenyue Qin, Morten Lee, Margaret MacGibeny, Ailish Hanly, Angela Sadlowski, Shanin Chowdhury, Xuguang Ai, Jeffrey Gehlhausen, Qingyu Chen,
Abstract要約: MLLM(Multimodal large language model)は、一般に利用可能な皮膚科のベンチマークで約束されている。オープンウェイトMLLM (InternVL-Chat v1.5, LLaVA-Med v1.5, SkinGPT4, MedGemma-4B-Instruct) と市販MLLM (GPT-4.1) を3つの皮膚科学データセットで比較検討した。診断性能は公開データセットでは軽度であり、現実世界のコホートでは大幅に低下した。
参考スコア（独自算出の注目度）: 6.096816682256677
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal large language models (MLLMs) have demonstrated promise on publicly available dermatology benchmarks. However, benchmark performance may not generalize to real-world dermatologic decision-making. To quantify this benchmark-to-bedside gap, we evaluated four open-weight MLLMs (InternVL-Chat v1.5, LLaVA-Med v1.5, SkinGPT4 and MedGemma-4B-Instruct) and one commercial MLLM (GPT-4.1) across three publicly available dermatology datasets and a retrospective multi-site hospital-based dermatology consultation cohort comprising 5,811 cases and 46,405 clinical images. Models were evaluated on two clinically relevant tasks: differential diagnosis generation and severity-based triage. Diagnostic performance was modest on public datasets and declined substantially in the real-world cohort. On public benchmarks, top-3 diagnostic accuracy reached 26.55% for the best open-weight model and 42.25% for GPT-4.1. On real-world consultation cases using images alone, top-3 diagnostic accuracy fell to 1.50%-13.35% among open-weight models and 24.65% for GPT-4.1. Incorporating clinical context improved performance across all models, increasing top-3 diagnostic accuracy up to 28.75% among open-weight models and 38.93% for GPT-4.1. However, model outputs were highly sensitive to incomplete or erroneous consultation context. For severity-based triage, models achieved moderate sensitivity (above 60%), suggesting potential utility for screening but insufficient reliability for clinical deployment. These findings demonstrate that benchmark performance substantially overestimates the real-world clinical capability of current dermatology MLLMs.
Abstract（参考訳）: MLLM(Multimodal large language model)は、一般に利用可能な皮膚科のベンチマークで約束されている。しかし、ベンチマーク性能は現実世界の皮膚学的な意思決定に一般化できない。このベンチマーク・ツー・ベッドサイドのギャップを定量化するため,5,811例,46,405例の臨床画像から,オープンウェイトMLLM (InternVL-Chat v1.5, LLaVA-Med v1.5, SkinGPT4, MedGemma-4B-Instruct) と1例の市販MLLM (GPT-4.1) の3つの皮膚科データセット,およびレトロスペクティブ型多施設型病院皮膚科コンサルテーションコホートを比較検討した。鑑別診断と重症度に基づくトリアージの2つの臨床的課題について検討した。診断性能は公開データセットでは軽度であり、現実世界のコホートでは大幅に低下した。公開ベンチマークでは、トップ3の診断精度は、最高のオープンウェイトモデルでは26.55%、GPT-4.1では42.25%に達した。画像のみを使用した実世界のコンサルテーションの場合、診断精度はオープンウェイトモデルで1.50%-13.35%、GPT-4.1では24.65%に低下した。臨床コンテキストを組み込むことで全てのモデルのパフォーマンスが向上し、オープンウェイトモデルの診断精度は28.75%、GPT-4.1では38.93%向上した。しかし、モデル出力は不完全または誤ったコンサルテーションコンテキストに対して非常に敏感であった。重症度に基づくトリアージでは、モデルが適度な感度(60%以上)を達成した。これらの結果は,現在の皮膚科MLLMの実際の臨床能力について,ベンチマーク性能が著しく過大評価されていることを示唆している。

論文の概要: Are Multimodal LLMs Ready for Clinical Dermatology? A Real-World Evaluation in Dermatology

関連論文リスト