Fugu-MT 論文翻訳(概要): Capabilities of GPT-5 on Multimodal Medical Reasoning

論文の概要: Capabilities of GPT-5 on Multimodal Medical Reasoning

arxiv url: http://arxiv.org/abs/2508.08224v2
Date: Wed, 13 Aug 2025 05:32:22 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-14 11:55:47.619825
Title: Capabilities of GPT-5 on Multimodal Medical Reasoning
Title（参考訳）: マルチモーダル医療推論におけるGPT-5の機能
Authors: Shansong Wang, Mingzhe Hu, Qiang Li, Mojtaba Safari, Xiaofeng Yang,
Abstract要約: 本研究は,GPT-5を医学的意思決定支援の汎用的マルチモーダル推論器として位置づける。 GPT-5, GPT-5-mini, GPT-5-nano, GPT-4o-2024-11-20を, MedQA, MedXpertQA (text and multimodal), MMLU医療サブセット, USMLE自己評価試験, VQA-RADの標準分割と比較した。
参考スコア（独自算出の注目度）: 4.403894457826502
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Recent advances in large language models (LLMs) have enabled general-purpose systems to perform increasingly complex domain-specific reasoning without extensive fine-tuning. In the medical domain, decision-making often requires integrating heterogeneous information sources, including patient narratives, structured data, and medical images. This study positions GPT-5 as a generalist multimodal reasoner for medical decision support and systematically evaluates its zero-shot chain-of-thought reasoning performance on both text-based question answering and visual question answering tasks under a unified protocol. We benchmark GPT-5, GPT-5-mini, GPT-5-nano, and GPT-4o-2024-11-20 against standardized splits of MedQA, MedXpertQA (text and multimodal), MMLU medical subsets, USMLE self-assessment exams, and VQA-RAD. Results show that GPT-5 consistently outperforms all baselines, achieving state-of-the-art accuracy across all QA benchmarks and delivering substantial gains in multimodal reasoning. On MedXpertQA MM, GPT-5 improves reasoning and understanding scores by +29.26% and +26.18% over GPT-4o, respectively, and surpasses pre-licensed human experts by +24.23% in reasoning and +29.40% in understanding. In contrast, GPT-4o remains below human expert performance in most dimensions. A representative case study demonstrates GPT-5's ability to integrate visual and textual cues into a coherent diagnostic reasoning chain, recommending appropriate high-stakes interventions. Our results show that, on these controlled multimodal reasoning benchmarks, GPT-5 moves from human-comparable to above human-expert performance. This improvement may substantially inform the design of future clinical decision-support systems.
Abstract（参考訳）: 大規模言語モデル(LLM)の最近の進歩により、汎用システムはより複雑なドメイン固有推論を広範囲の微調整なしに実行できるようになった。医療分野では、意思決定は患者物語、構造化データ、医療画像などの異種情報ソースを統合する必要があることが多い。本研究は,GPT-5を医用意思決定支援の汎用的マルチモーダル推論器として位置づけ,テキストベースの質問応答と視覚的質問応答の両タスクにおけるゼロショット連鎖推論性能を統一的プロトコル下で体系的に評価する。 GPT-5, GPT-5-mini, GPT-5-nano, GPT-4o-2024-11-20を, MedQA, MedXpertQA (text and multimodal), MMLU医療サブセット, USMLE自己評価試験, VQA-RADの標準分割と比較した。その結果、GPT-5は全てのベースラインを一貫して上回り、全てのQAベンチマークで最先端の精度を達成し、マルチモーダル推論においてかなりの利益をもたらすことがわかった。 MedXpertQA MM では、GPT-5 は GPT-4o よりも +29.26% と +26.18% の推論と理解のスコアを改善し、事前ライセンスされた人間の専門家を +24.23% の推論と +29.40% の理解で上回っている。対照的に、GPT-4oは、ほとんどの次元において人間の専門家のパフォーマンスより低いままである。代表的なケーススタディでは、GPT-5の視覚的およびテキスト的手がかりをコヒーレントな診断推論チェーンに統合する能力を示し、適切なハイテイク介入を推奨している。これらの制御されたマルチモーダル推論ベンチマークにおいて, GPT-5 は人間に比較可能な性能から人間に比較可能な性能に移行した。この改善は将来の臨床診断支援システムの設計に大きな影響を与える可能性がある。

論文の概要: Capabilities of GPT-5 on Multimodal Medical Reasoning

関連論文リスト