Fugu-MT 論文翻訳(概要): Beyond Diagnosis: Evaluating Multimodal LLMs for Pathology Localization in Chest Radiographs

論文の概要: Beyond Diagnosis: Evaluating Multimodal LLMs for Pathology Localization in Chest Radiographs

arxiv url: http://arxiv.org/abs/2509.18015v1
Date: Mon, 22 Sep 2025 16:54:23 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-23 18:58:16.519004
Title: Beyond Diagnosis: Evaluating Multimodal LLMs for Pathology Localization in Chest Radiographs
Title（参考訳）: 診断を超えて : 胸部X線像における病理組織像のマルチモーダルLCMの評価
Authors: Advait Gosai, Arun Kavishwar, Stephanie L. McNamara, Soujanya Samineni, Renato Umeton, Alexander Chowdhury, William Lotter,
Abstract要約: 胸部X線写真における病理像の局所化能力について,2つの汎用大言語モデル (LLM) とドメイン固有モデル (MedGemma) を評価した。 GPT-5は49.7%、GPT-4(39.1%)とMedGemma(17.7%)の順で、いずれもタスク固有のCNNベースライン(59.9%)と放射線学ベンチマーク(80.1%)より低い。 GPT-4は, 解剖学的位置が固定された病理では良好に機能したが, 空間的変化に悩まされ, より頻度の低い予測が得られた。
参考スコア（独自算出の注目度）: 33.80781505782195
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent work has shown promising performance of frontier large language models (LLMs) and their multimodal counterparts in medical quizzes and diagnostic tasks, highlighting their potential for broad clinical utility given their accessible, general-purpose nature. However, beyond diagnosis, a fundamental aspect of medical image interpretation is the ability to localize pathological findings. Evaluating localization not only has clinical and educational relevance but also provides insight into a model's spatial understanding of anatomy and disease. Here, we systematically assess two general-purpose MLLMs (GPT-4 and GPT-5) and a domain-specific model (MedGemma) in their ability to localize pathologies on chest radiographs, using a prompting pipeline that overlays a spatial grid and elicits coordinate-based predictions. Averaged across nine pathologies in the CheXlocalize dataset, GPT-5 exhibited a localization accuracy of 49.7%, followed by GPT-4 (39.1%) and MedGemma (17.7%), all lower than a task-specific CNN baseline (59.9%) and a radiologist benchmark (80.1%). Despite modest performance, error analysis revealed that GPT-5's predictions were largely in anatomically plausible regions, just not always precisely localized. GPT-4 performed well on pathologies with fixed anatomical locations, but struggled with spatially variable findings and exhibited anatomically implausible predictions more frequently. MedGemma demonstrated the lowest performance on all pathologies, showing limited capacity to generalize to this novel task. Our findings highlight both the promise and limitations of current MLLMs in medical imaging and underscore the importance of integrating them with task-specific tools for reliable use.
Abstract（参考訳）: 最近の研究は、医療クイズや診断タスクにおける、フロンティア大言語モデル(LLM)とそのマルチモーダルモデルの有望な性能を示しており、それらがアクセス可能で汎用的な性質から、幅広い臨床応用の可能性を強調している。しかし、診断以外では、画像解釈の基本的な側面は、病理所見の局在化である。ローカライゼーションを評価することは、臨床と教育の関連性だけでなく、モデルが解剖学と病気を空間的に理解することへの洞察を与える。そこで我々は,2つの汎用MLLM (GPT-4, GPT-5) とドメイン固有モデル (MedGemma) を,空間格子をオーバーレイして座標に基づく予測を誘発するプロンプトパイプラインを用いて,胸部X線像の局所化能力の評価を行った。 CheXローカライズデータセットの9つの病理から、GPT-5は49.7%、GPT-4(39.1%)とMedGemma(17.7%)の順で、いずれもタスク固有のCNNベースライン(59.9%)とラジオロジスティクスベンチマーク(80.1%)より低い。わずかな性能にもかかわらず、エラー解析により、GPT-5の予測は解剖学的に妥当な領域であり、必ずしも正確に局所化されていないことがわかった。 GPT-4は, 解剖学的位置が固定された病理では良好に機能したが, 空間的変化に苦慮し, 解剖学的に理解不能な予測が頻発した。 MedGemmaはこの新しいタスクを一般化する能力に制限があることを示している。本研究は,医療画像における現在のMLLMの持つ可能性と限界を両立させ,信頼性向上のためのタスク固有のツールと統合することの重要性を浮き彫りにした。

論文の概要: Beyond Diagnosis: Evaluating Multimodal LLMs for Pathology Localization in Chest Radiographs

関連論文リスト