Capabilities of GPT-5 on Multimodal Medical Reasoning
- URL: http://arxiv.org/abs/2508.08224v2
- Date: Wed, 13 Aug 2025 05:32:22 GMT
- Title: Capabilities of GPT-5 on Multimodal Medical Reasoning
- Authors: Shansong Wang, Mingzhe Hu, Qiang Li, Mojtaba Safari, Xiaofeng Yang,
- Abstract summary: This study positions GPT-5 as a generalist multimodal reasoner for medical decision support.<n>We benchmark GPT-5, GPT-5-mini, GPT-5-nano, and GPT-4o-2024-11-20 against standardized splits of MedQA, MedXpertQA (text and multimodal), MMLU medical subsets, USMLE self-assessment exams, and VQA-RAD.
- Score: 4.403894457826502
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recent advances in large language models (LLMs) have enabled general-purpose systems to perform increasingly complex domain-specific reasoning without extensive fine-tuning. In the medical domain, decision-making often requires integrating heterogeneous information sources, including patient narratives, structured data, and medical images. This study positions GPT-5 as a generalist multimodal reasoner for medical decision support and systematically evaluates its zero-shot chain-of-thought reasoning performance on both text-based question answering and visual question answering tasks under a unified protocol. We benchmark GPT-5, GPT-5-mini, GPT-5-nano, and GPT-4o-2024-11-20 against standardized splits of MedQA, MedXpertQA (text and multimodal), MMLU medical subsets, USMLE self-assessment exams, and VQA-RAD. Results show that GPT-5 consistently outperforms all baselines, achieving state-of-the-art accuracy across all QA benchmarks and delivering substantial gains in multimodal reasoning. On MedXpertQA MM, GPT-5 improves reasoning and understanding scores by +29.26% and +26.18% over GPT-4o, respectively, and surpasses pre-licensed human experts by +24.23% in reasoning and +29.40% in understanding. In contrast, GPT-4o remains below human expert performance in most dimensions. A representative case study demonstrates GPT-5's ability to integrate visual and textual cues into a coherent diagnostic reasoning chain, recommending appropriate high-stakes interventions. Our results show that, on these controlled multimodal reasoning benchmarks, GPT-5 moves from human-comparable to above human-expert performance. This improvement may substantially inform the design of future clinical decision-support systems.
Related papers
- Evaluating GPT-5 as a Multimodal Clinical Reasoner: A Landscape Commentary [36.736436091313585]
This commentary is the first controlled, cross-sectional evaluation of the GPT-5 family (GPT-5, GPT-5 Mini, GPT-5 Nano) against its predecessor GPT-4o.<n> GPT-5 demonstrated substantial gains in expert-level textual reasoning, with absolute improvements exceeding 25 percentage-points on MedXpertQA.<n>When tasked with multimodal synthesis, GPT-5 effectively leveraged this enhanced reasoning capacity to ground uncertain clinical narratives in concrete imaging evidence.
arXiv Detail & Related papers (2026-03-05T03:24:48Z) - OMGs: A multi-agent system supporting MDT decision-making across the ovarian tumour care continuum [51.97232679580821]
Ovarian tumour management has increasingly relied on multidisciplinary tumour board (MDT) deliberation.<n>Most patients worldwide lack access to timely expert consensus.<n>Here we present OMGs (Ovarian tumour Multidisciplinary intelligent aGent System), a multi-agent AI framework.
arXiv Detail & Related papers (2026-02-14T14:13:10Z) - OpenAI GPT-5 System Card [247.27796140570612]
GPT-5 is a unified system with a smart and fast model that answers most questions.<n>A real-time router decides which model to use based on conversation type, complexity, tool needs, and explicit intent.<n>Once usage limits are reached, a mini version of each model handles remaining queries.
arXiv Detail & Related papers (2025-12-19T07:05:38Z) - Benchmarking GPT-5 for biomedical natural language processing [17.663813433200122]
This study extends a unified benchmark to evaluate GPT-5 and GPT-4o across five core biomedical NLP tasks.<n> GPT-5 consistently outperformed GPT-4o, with the largest gains on reasoning-intensive datasets.
arXiv Detail & Related papers (2025-08-28T13:06:53Z) - Capabilities of GPT-5 across critical domains: Is it the next breakthrough? [0.0]
GPT-4 by OpenAI introduced advances in reasoning, multimodality, and task generalization.<n>Released in August 2025, GPT-5 incorporates a system-of-models architecture designed for task-specific optimization.<n>This study provides one of the first systematic comparisons of GPT-4 and GPT-5 using human raters from linguistics and clinical fields.
arXiv Detail & Related papers (2025-08-16T12:26:11Z) - Benchmarking GPT-5 for Zero-Shot Multimodal Medical Reasoning in Radiology and Radiation Oncology [4.156123728258067]
We present a zero-shot evaluation of GPT-5 and its smaller variants (GPT-5-mini, GPT-5-nano) against GPT-4o across three representative tasks.<n>Across all datasets, GPT-5 achieved the highest accuracy, with substantial gains over GPT-4o up to +200% in challenging anatomical regions.<n>GPT-5 delivers consistent and often pronounced performance improvements over GPT-4o in both image-grounded reasoning and domain-specific numerical problem-solving.
arXiv Detail & Related papers (2025-08-15T16:14:51Z) - Performance of GPT-5 in Brain Tumor MRI Reasoning [4.156123728258067]
Large language models (LLMs) have enabled visual question answering (VQA) approaches that integrate image interpretation with natural language reasoning.<n>We evaluated GPT-4o, GPT-5-nano, GPT-5-mini, and GPT-5 on a curated brain tumor VQA benchmark.<n>Results showed that GPT-5-mini achieved the highest macro-average accuracy (44.19%), followed by GPT-5 (43.71%), GPT-4o (41.49%), and GPT-5-nano (35.85%)
arXiv Detail & Related papers (2025-08-14T17:35:31Z) - GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals.
GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date.
It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z) - Capabilities of Gemini Models in Medicine [100.60391771032887]
We introduce Med-Gemini, a family of highly capable multimodal models specialized in medicine.
We evaluate Med-Gemini on 14 medical benchmarks, establishing new state-of-the-art (SoTA) performance on 10 of them.
Our results offer compelling evidence for Med-Gemini's potential, although further rigorous evaluation will be crucial before real-world deployment.
arXiv Detail & Related papers (2024-04-29T04:11:28Z) - Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine [15.491432387608112]
Generative Pre-trained Transformer 4 with Vision (GPT-4V) outperforms human physicians in medical challenge tasks.
Our study extends the current scope by conducting a comprehensive analysis of GPT-4V's rationales of image comprehension, recall of medical knowledge, and step-by-step multimodal reasoning.
arXiv Detail & Related papers (2024-01-16T14:41:20Z) - Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case
Study in Medicine [89.46836590149883]
We build on a prior study of GPT-4's capabilities on medical challenge benchmarks in the absence of special training.
We find that prompting innovation can unlock deeper specialist capabilities and show that GPT-4 easily tops prior leading results for medical benchmarks.
With Medprompt, GPT-4 achieves state-of-the-art results on all nine of the benchmark datasets in the MultiMedQA suite.
arXiv Detail & Related papers (2023-11-28T03:16:12Z) - A Systematic Evaluation of GPT-4V's Multimodal Capability for Medical
Image Analysis [87.25494411021066]
GPT-4V's multimodal capability for medical image analysis is evaluated.
It is found that GPT-4V excels in understanding medical images and generates high-quality radiology reports.
It is found that its performance for medical visual grounding needs to be substantially improved.
arXiv Detail & Related papers (2023-10-31T11:39:09Z) - Multimodal ChatGPT for Medical Applications: an Experimental Study of
GPT-4V [20.84152508192388]
We critically evaluate the capabilities of the state-of-the-art multimodal large language model, GPT-4 with Vision (GPT-4V)
Our experiments thoroughly assess GPT-4V's proficiency in answering questions paired with images using both pathology and radiology datasets.
The experiments with accuracy score conclude that the current version of GPT-4V is not recommended for real-world diagnostics.
arXiv Detail & Related papers (2023-10-29T16:26:28Z) - Capabilities of GPT-4 on Medical Challenge Problems [23.399857819743158]
GPT-4 is a general-purpose model that is not specialized for medical problems through training or to solve clinical tasks.
We present a comprehensive evaluation of GPT-4 on medical competency examinations and benchmark datasets.
arXiv Detail & Related papers (2023-03-20T16:18:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.