Fugu-MT 論文翻訳(概要): Evaluation of Medical Vision Language Models HuluMed and MedGemma, and general purpose chatbots Gemma 3, ChatGPT Plus, and Claude Pro on real previously unseen wound images

論文の概要: Evaluation of Medical Vision Language Models HuluMed and MedGemma, and general purpose chatbots Gemma 3, ChatGPT Plus, and Claude Pro on real previously unseen wound images

arxiv url: http://arxiv.org/abs/2606.20723v1
Date: Tue, 16 Jun 2026 22:33:14 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-26 13:32:49.568103
Title: Evaluation of Medical Vision Language Models HuluMed and MedGemma, and general purpose chatbots Gemma 3, ChatGPT Plus, and Claude Pro on real previously unseen wound images
Title（参考訳）: HuluMedとMedGemmaと汎用チャットボットGemma 3、ChatGPT Plus、Claude Proの医療ビジョン言語モデルの評価
Authors: Yunzhe Xue, Mohammed Saim Ahmed Quadri, Neal Panse, Justin W. Ady, Usman Roshan,
Abstract要約: 本研究は, 臨床創傷評価のための, 汎用および医療専門のオープンソースおよびプロプライエタリなVision-Language Models (VLMs) の性能評価である。 ChatGPTは174/240の正解(72.50%)と149/240のクロード(62.08%)で最高パフォーマンスを達成した。以上の結果から,現在,フロンティア汎用マルチモーダルシステムは医療用代替品よりも創傷解析性能が著しく高いことが示唆された。
参考スコア（独自算出の注目度）: 2.6097841018267616
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Chronic wound assessment remains a clinically challenging task that requires accurate interpretation of wound morphology, tissue composition, vascular characteristics, and infection risk. Recent advances in Vision-Language Models (VLMs) have introduced the possibility of automated multimodal wound analysis through image understanding combined with clinical reasoning. This study evaluates the performance of several general-purpose and medically specialized open-source and proprietary VLMs for clinical wound assessment using an expanded, curated dataset of 20 clinically diverse wounds spanning vascular, surgical, ischemic, venous, lymphedema, and amputation-related etiologies. Six VLMs were evaluated using a structured twelve-question clinical framework covering wound classification, infection risk, vascular intervention recommendations, debridement urgency, wound therapy selection, and advanced management planning. Across 20 wound cases and 240 clinician-graded wound-analysis decisions, ChatGPT achieved the highest overall performance with 174/240 correct responses (72.50%), followed by Claude with 149/240 (62.08%). Among the open-source and medically specialized models, HuluMed achieved the strongest performance with 96/240 correct responses (40.00%), followed by Gemma 3 (81/240, 33.75%), MedGemma 4B (62/240, 25.83%), and MedGemma 27B (42/240, 17.50%). The findings suggest that frontier general-purpose multimodal systems currently demonstrate substantially stronger wound-analysis performance than medically specialized alternatives, highlighting the continued importance of broad multimodal reasoning capabilities alongside domain-specific medical knowledge. Although current VLMs demonstrate promising potential for clinical decision support, substantial limitations remain in advanced wound-management reasoning, procedural planning, and autonomous clinical reliability.
Abstract（参考訳）: 慢性的な創傷評価は、傷の形態、組織組成、血管特性、感染リスクの正確な解釈を必要とする臨床的に困難な課題である。近年のVLM(Vision-Language Models)の進歩は,画像理解と臨床推論を組み合わせたマルチモーダル傷の自動解析の可能性をもたらしている。本研究は, 血管, 外科的, 虚血, 静脈性, リンパ腫, 切断関連エチオロジーにまたがる20種類の臨床多彩な傷を対象とし, 臨床診断のための汎用的, 医学的に専門的なVLMの性能評価を行った。創傷分類, 感染リスク, 血管介入推奨, 重度緊急性, 創傷治療選択, 高度管理計画を含む構造的12項目の臨床枠組みを用いてVLMの評価を行った。 20件の創傷と240件のクリニカルグレードによる創傷分析の判定で、ChatGPTは174/240の正答率(72.50%)、クロード149/240(62.08%)で最高成績を記録した。オープンソースと医療専門のモデルの中で、HuluMedは96/240の正解率(40.00%)、Gemma 3(81/240, 33.75%)、MedGemma 4B(62/240, 25.83%)、MedGemma 27B(42/240, 17.50%)で最強のパフォーマンスを達成した。この結果は、現在、フロンティアの汎用マルチモーダルシステムは、医療専門の代替品よりもはるかに強力な創傷分析性能を示しており、ドメイン固有の医療知識とともに、広範囲な多モーダル推論能力の重要性が引き続き強調されていることを示唆している。現在のVLMは、臨床診断支援の有望な可能性を示しているが、高度な創傷管理推論、手続き計画、自律的な臨床信頼性には、かなりの制限が残っている。

論文の概要: Evaluation of Medical Vision Language Models HuluMed and MedGemma, and general purpose chatbots Gemma 3, ChatGPT Plus, and Claude Pro on real previously unseen wound images

関連論文リスト