Fugu-MT 論文翻訳(概要): Evaluating Multimodal LLMs for Inpatient Diagnosis: Real-World Performance, Safety, and Cost Across Ten Frontier Models

論文の概要: Evaluating Multimodal LLMs for Inpatient Diagnosis: Real-World Performance, Safety, and Cost Across Ten Frontier Models

arxiv url: http://arxiv.org/abs/2604.16980v1
Date: Sat, 18 Apr 2026 12:42:57 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:52.271771
Title: Evaluating Multimodal LLMs for Inpatient Diagnosis: Real-World Performance, Safety, and Cost Across Ten Frontier Models
Title（参考訳）: 臨床診断のためのマルチモーダルLCMの評価:10フロンティアモデルにおける実環境性能, 安全性, コスト
Authors: Bruce A. Bassett, Amy Rouillard, Sitwala Mundia, Michael Cameron Gramanie, Linda Camara, Ziyaad Dangor, Shabir A. Madhi, Kajal Morar, Marlvin T. Ncube, Ismail Kalla, Haroon Saloojee,
Abstract要約: 大きな言語モデル(LLM)は、診断支援のためにますます提案されている。特に低所得国 (LMIC) の公立病院において, 実世界のマルチモーダル入院患者データを用いた評価は少ない。南アフリカの3次公立病院で539例の多施設入院患者の振り返り評価であるVALIDを施行した。
参考スコア（独自算出の注目度）: 0.7857924499207116
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Background: Large language models (LLMs) are increasingly proposed for diagnostic support, but few evaluations use real-world multimodal inpatient data, particularly in low and middle-income country (LMIC) public hospitals. Methods: We conducted VALID, a retrospective evaluation of 539 multimodal inpatient cases from a tertiary public hospital in South Africa. Inputs included radiology imaging (CT, MRI, CXR) and reports, laboratory results, clinical notes, and vital signs. Expert panels adjudicated 300 cases (balanced and discordant subsets) to establish ground truth diagnoses, differentials, and reasoning. Ten multimodal LLMs generated zero-shot outputs. A calibrated three-model LLM Jury scored all outputs and routine ward diagnoses across diagnostic accuracy, differential quality, reasoning, and patient safety (>10,000 evaluations). Primary outcomes were composite scores ($S_3$, $S_4$) and win rates. Results: (i) LLM performance was tightly clustered (<15% variation) despite large cost differences; low-cost models performed comparably to top models. (ii) All LLMs significantly outperformed routine ward diagnoses on average diagnostic and safety scores. (iii) Top performance was achieved by GPT-5.1, followed by Gemini models. (vi) Adding radiology reports improved performance by 6%. (v) Diagnostic and reasoning scores were highly correlated ($ρ= 0.85$). (vi) Output rates varied (65-100%) due to input constraints. Results were robust across subsets and evaluation design. Conclusions: Across a real-world LMIC dataset, multimodal LLMs showed similar diagnostic performance despite large cost differences and outperformed routine care on average safety metrics. Affordability, robustness, and deployment constraints may outweigh marginal performance differences in LMIC settings.
Abstract（参考訳）: 背景: 大規模言語モデル (LLMs) は, 診断支援のためにますます提案されているが, 低所得国 (LMIC) の公立病院において, 実世界のマルチモーダル入院データを用いた評価は少ない。方法: 南アフリカの3次公立病院において, マルチモーダル入院症例539例の振り返り評価としてVALIDを施行した。入力には放射線画像(CT, MRI, CXR)と報告, 実験結果, 臨床ノート, バイタルサインが含まれていた。専門家パネルは300のケース(バランスの取れたサブセットと不協和なサブセット)を調整し、基礎的な真理診断、微分、推論を確立した。 10個のマルチモーダルLCMがゼロショット出力を生成した。校正された3モデルLLM Juryは、診断精度、差分品質、推論、患者の安全性(>10,000評価)のすべてのアウトプットと定期的な病棟診断を行った。主な結果は複合得点(S_3$,$S_4$)と勝利率であった。結果 (i)LLM性能は, コスト差が大きいにもかかわらず, 密集した(<15%の変動) であり, 低コストモデルは上位モデルと相容れない性能を示した。 (II)全てのLSMは,平均診断および安全性スコアにおいて,定期的な病棟診断よりも有意に優れていた。 (iii)GPT-5.1で最高性能を達成し、その後ジェミニモデルが続いた。 (vi)放射線学報告の追加により、性能が6%向上した。 (v)診断と推論のスコアは高い相関(ρ=0.85$)を示した。 (vi)入力制約により出力率(65～100%)が変化した。結果はサブセットと評価設計で堅牢だった。結論: 実世界のLMICデータセット全体において, コスト差が大きく, 平均安全性指標のルーチンケアが優れていたにもかかわらず, マルチモーダルLSMは類似した診断性能を示した。拡張性、堅牢性、デプロイメントの制約は、LMIC設定の限界パフォーマンスの違いを上回る可能性がある。

論文の概要: Evaluating Multimodal LLMs for Inpatient Diagnosis: Real-World Performance, Safety, and Cost Across Ten Frontier Models

関連論文リスト