Related papers: Capabilities of GPT-5 on Multimodal Medical Reasoning

Capabilities of GPT-5 on Multimodal Medical Reasoning

URL: http://arxiv.org/abs/2508.08224v2
Date: Wed, 13 Aug 2025 05:32:22 GMT
Title: Capabilities of GPT-5 on Multimodal Medical Reasoning
Authors: Shansong Wang, Mingzhe Hu, Qiang Li, Mojtaba Safari, Xiaofeng Yang,
Abstract summary: This study positions GPT-5 as a generalist multimodal reasoner for medical decision support.<n>We benchmark GPT-5, GPT-5-mini, GPT-5-nano, and GPT-4o-2024-11-20 against standardized splits of MedQA, MedXpertQA (text and multimodal), MMLU medical subsets, USMLE self-assessment exams, and VQA-RAD.
Score: 4.403894457826502
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Recent advances in large language models (LLMs) have enabled general-purpose systems to perform increasingly complex domain-specific reasoning without extensive fine-tuning. In the medical domain, decision-making often requires integrating heterogeneous information sources, including patient narratives, structured data, and medical images. This study positions GPT-5 as a generalist multimodal reasoner for medical decision support and systematically evaluates its zero-shot chain-of-thought reasoning performance on both text-based question answering and visual question answering tasks under a unified protocol. We benchmark GPT-5, GPT-5-mini, GPT-5-nano, and GPT-4o-2024-11-20 against standardized splits of MedQA, MedXpertQA (text and multimodal), MMLU medical subsets, USMLE self-assessment exams, and VQA-RAD. Results show that GPT-5 consistently outperforms all baselines, achieving state-of-the-art accuracy across all QA benchmarks and delivering substantial gains in multimodal reasoning. On MedXpertQA MM, GPT-5 improves reasoning and understanding scores by +29.26% and +26.18% over GPT-4o, respectively, and surpasses pre-licensed human experts by +24.23% in reasoning and +29.40% in understanding. In contrast, GPT-4o remains below human expert performance in most dimensions. A representative case study demonstrates GPT-5's ability to integrate visual and textual cues into a coherent diagnostic reasoning chain, recommending appropriate high-stakes interventions. Our results show that, on these controlled multimodal reasoning benchmarks, GPT-5 moves from human-comparable to above human-expert performance. This improvement may substantially inform the design of future clinical decision-support systems.

Related papers

Evaluating GPT-5 as a Multimodal Clinical Reasoner: A Landscape Commentary [36.736436091313585]
This commentary is the first controlled, cross-sectional evaluation of the GPT-5 family (GPT-5, GPT-5 Mini, GPT-5 Nano) against its predecessor GPT-4o.<n> GPT-5 demonstrated substantial gains in expert-level textual reasoning, with absolute improvements exceeding 25 percentage-points on MedXpertQA.<n>When tasked with multimodal synthesis, GPT-5 effectively leveraged this enhanced reasoning capacity to ground uncertain clinical narratives in concrete imaging evidence.
arXiv Detail & Related papers (2026-03-05T03:24:48Z)
OMGs: A multi-agent system supporting MDT decision-making across the ovarian tumour care continuum [51.97232679580821]
Ovarian tumour management has increasingly relied on multidisciplinary tumour board (MDT) deliberation.<n>Most patients worldwide lack access to timely expert consensus.<n>Here we present OMGs (Ovarian tumour Multidisciplinary intelligent aGent System), a multi-agent AI framework.
arXiv Detail & Related papers (2026-02-14T14:13:10Z)
OpenAI GPT-5 System Card [247.27796140570612]
GPT-5 is a unified system with a smart and fast model that answers most questions.<n>A real-time router decides which model to use based on conversation type, complexity, tool needs, and explicit intent.<n>Once usage limits are reached, a mini version of each model handles remaining queries.
arXiv Detail & Related papers (2025-12-19T07:05:38Z)
Benchmarking GPT-5 for biomedical natural language processing [17.663813433200122]
This study extends a unified benchmark to evaluate GPT-5 and GPT-4o across five core biomedical NLP tasks.<n> GPT-5 consistently outperformed GPT-4o, with the largest gains on reasoning-intensive datasets.
arXiv Detail & Related papers (2025-08-28T13:06:53Z)
Capabilities of GPT-5 across critical domains: Is it the next breakthrough? [0.0]
GPT-4 by OpenAI introduced advances in reasoning, multimodality, and task generalization.<n>Released in August 2025, GPT-5 incorporates a system-of-models architecture designed for task-specific optimization.<n>This study provides one of the first systematic comparisons of GPT-4 and GPT-5 using human raters from linguistics and clinical fields.
arXiv Detail & Related papers (2025-08-16T12:26:11Z)
Benchmarking GPT-5 for Zero-Shot Multimodal Medical Reasoning in Radiology and Radiation Oncology [4.156123728258067]
We present a zero-shot evaluation of GPT-5 and its smaller variants (GPT-5-mini, GPT-5-nano) against GPT-4o across three representative tasks.<n>Across all datasets, GPT-5 achieved the highest accuracy, with substantial gains over GPT-4o up to +200% in challenging anatomical regions.<n>GPT-5 delivers consistent and often pronounced performance improvements over GPT-4o in both image-grounded reasoning and domain-specific numerical problem-solving.
arXiv Detail & Related papers (2025-08-15T16:14:51Z)
Performance of GPT-5 in Brain Tumor MRI Reasoning [4.156123728258067]
Large language models (LLMs) have enabled visual question answering (VQA) approaches that integrate image interpretation with natural language reasoning.<n>We evaluated GPT-4o, GPT-5-nano, GPT-5-mini, and GPT-5 on a curated brain tumor VQA benchmark.<n>Results showed that GPT-5-mini achieved the highest macro-average accuracy (44.19%), followed by GPT-5 (43.71%), GPT-4o (41.49%), and GPT-5-nano (35.85%)
arXiv Detail & Related papers (2025-08-14T17:35:31Z)
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals. GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date. It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z)
Capabilities of Gemini Models in Medicine [100.60391771032887]
We introduce Med-Gemini, a family of highly capable multimodal models specialized in medicine. We evaluate Med-Gemini on 14 medical benchmarks, establishing new state-of-the-art (SoTA) performance on 10 of them. Our results offer compelling evidence for Med-Gemini's potential, although further rigorous evaluation will be crucial before real-world deployment.
arXiv Detail & Related papers (2024-04-29T04:11:28Z)
Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine [15.491432387608112]
Generative Pre-trained Transformer 4 with Vision (GPT-4V) outperforms human physicians in medical challenge tasks. Our study extends the current scope by conducting a comprehensive analysis of GPT-4V's rationales of image comprehension, recall of medical knowledge, and step-by-step multimodal reasoning.
arXiv Detail & Related papers (2024-01-16T14:41:20Z)
Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine [89.46836590149883]
We build on a prior study of GPT-4's capabilities on medical challenge benchmarks in the absence of special training. We find that prompting innovation can unlock deeper specialist capabilities and show that GPT-4 easily tops prior leading results for medical benchmarks. With Medprompt, GPT-4 achieves state-of-the-art results on all nine of the benchmark datasets in the MultiMedQA suite.
arXiv Detail & Related papers (2023-11-28T03:16:12Z)
A Systematic Evaluation of GPT-4V's Multimodal Capability for Medical Image Analysis [87.25494411021066]
GPT-4V's multimodal capability for medical image analysis is evaluated. It is found that GPT-4V excels in understanding medical images and generates high-quality radiology reports. It is found that its performance for medical visual grounding needs to be substantially improved.
arXiv Detail & Related papers (2023-10-31T11:39:09Z)
Multimodal ChatGPT for Medical Applications: an Experimental Study of GPT-4V [20.84152508192388]
We critically evaluate the capabilities of the state-of-the-art multimodal large language model, GPT-4 with Vision (GPT-4V) Our experiments thoroughly assess GPT-4V's proficiency in answering questions paired with images using both pathology and radiology datasets. The experiments with accuracy score conclude that the current version of GPT-4V is not recommended for real-world diagnostics.
arXiv Detail & Related papers (2023-10-29T16:26:28Z)
Capabilities of GPT-4 on Medical Challenge Problems [23.399857819743158]
GPT-4 is a general-purpose model that is not specialized for medical problems through training or to solve clinical tasks. We present a comprehensive evaluation of GPT-4 on medical competency examinations and benchmark datasets.
arXiv Detail & Related papers (2023-03-20T16:18:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.