Related papers: Performance of leading large language models in May 2025 in Membership of the Royal College of General Practitioners-style examination questions: a cross-sectional analysis

Performance of leading large language models in May 2025 in Membership of the Royal College of General Practitioners-style examination questions: a cross-sectional analysis

URL: http://arxiv.org/abs/2506.02987v1
Date: Tue, 03 Jun 2025 15:25:38 GMT
Title: Performance of leading large language models in May 2025 in Membership of the Royal College of General Practitioners-style examination questions: a cross-sectional analysis
Authors: Richard Armitage,
Abstract summary: o3, Claude Opus 4, Grok3, and Gemini 2.5 Pro were tasked to answer 100 randomly chosen questions from the Royal College of General Practitioners GP SelfTest.<n>The total score of o3, Claude Opus 4, Grok3, and Gemini 2.5 Pro was 99.0%, 95.0%, 95.0%, and 95.0%, respectively.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Background: Large language models (LLMs) have demonstrated substantial potential to support clinical practice. Other than Chat GPT4 and its predecessors, few LLMs, especially those of the leading and more powerful reasoning model class, have been subjected to medical specialty examination questions, including in the domain of primary care. This paper aimed to test the capabilities of leading LLMs as of May 2025 (o3, Claude Opus 4, Grok3, and Gemini 2.5 Pro) in primary care education, specifically in answering Member of the Royal College of General Practitioners (MRCGP) style examination questions. Methods: o3, Claude Opus 4, Grok3, and Gemini 2.5 Pro were tasked to answer 100 randomly chosen multiple choice questions from the Royal College of General Practitioners GP SelfTest on 25 May 2025. Questions included textual information, laboratory results, and clinical images. Each model was prompted to answer as a GP in the UK and was provided with full question information. Each question was attempted once by each model. Responses were scored against correct answers provided by GP SelfTest. Results: The total score of o3, Claude Opus 4, Grok3, and Gemini 2.5 Pro was 99.0%, 95.0%, 95.0%, and 95.0%, respectively. The average peer score for the same questions was 73.0%. Discussion: All models performed remarkably well, and all substantially exceeded the average performance of GPs and GP registrars who had answered the same questions. o3 demonstrated the best performance, while the performances of the other leading models were comparable with each other and were not substantially lower than that of o3. These findings strengthen the case for LLMs, particularly reasoning models, to support the delivery of primary care, especially those that have been specifically trained on primary care clinical data.

Related papers

PanCanBench: A Comprehensive Benchmark for Evaluating Large Language Models in Pancreatic Oncology [48.732366302949515]
Large language models (LLMs) have achieved expert-level performance on standardized examinations, yet multiple-choice accuracy poorly reflects real-world clinical utility and safety.<n>We developed a human-in-the-loop pipeline to create expert rubrics for de-identified patient questions.<n>We evaluated 22 proprietary and open-source LLMs using an LLM-as-a-judge framework, measuring clinical completeness, factual accuracy, and web-search integration.
arXiv Detail & Related papers (2026-03-02T00:50:39Z)
Performance of Large Language Models in Answering Critical Care Medicine Questions [1.825224193230824]
Large Language Models tested on 871 Critical Care Medicine questions.<n>Llama3.1:70B outperformed 8B by 30%, with 60% average accuracy.
arXiv Detail & Related papers (2025-09-16T14:46:34Z)
Med-RewardBench: Benchmarking Reward Models and Judges for Medical Multimodal Large Language Models [57.73472878679636]
We introduce Med-RewardBench, the first benchmark specifically designed to evaluate medical reward models and judges.<n>Med-RewardBench features a multimodal dataset spanning 13 organ systems and 8 clinical departments, with 1,026 expert-annotated cases.<n>A rigorous three-step process ensures high-quality evaluation data across six clinically critical dimensions.
arXiv Detail & Related papers (2025-08-29T08:58:39Z)
MedQARo: A Large-Scale Benchmark for Medical Question Answering in Romanian [50.767415194856135]
We introduce MedQARo, the first large-scale medical QA benchmark in Romanian.<n>We construct a high-quality and large-scale dataset comprising 102,646 QA pairs related to cancer patients.
arXiv Detail & Related papers (2025-08-22T13:48:37Z)
MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks [47.486705282473984]
Large language models (LLMs) achieve near-perfect scores on medical exams.<n>These evaluations inadequately reflect complexity and diversity of real-world clinical practice.<n>We introduce MedHELM, an evaluation framework for assessing LLM performance for medical tasks.
arXiv Detail & Related papers (2025-05-26T22:55:49Z)
It is Too Many Options: Pitfalls of Multiple-Choice Questions in Generative AI and Medical Education [0.7771252627207672]
The performance of Large Language Models (LLMs) on multiple-choice question (MCQ) benchmarks is frequently cited as proof of their medical capabilities.<n>We created a novel benchmark of free-response questions with paired MCQs (FreeMedQA)<n>Using this benchmark, we evaluated three state-of-the-art LLMs (GPT-4o, GPT-3.5, and LLama-3-70B-instruct) and found an average absolute deterioration of 39.43% in performance on free-response questions.
arXiv Detail & Related papers (2025-03-13T19:42:04Z)
Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references.<n>We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey.<n>Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z)
Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions.<n>We propose a novel approach utilizing structured medical reasoning.<n>Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z)
oRetrieval Augmented Generation for 10 Large Language Models and its Generalizability in Assessing Medical Fitness [4.118721833273984]
Large Language Models (LLMs) show potential for medical applications but often lack specialized clinical knowledge. Retrieval Augmented Generation (RAG) allows customization with domain-specific information, making it suitable for healthcare. This study evaluates the accuracy, consistency, and safety of RAG models in determining fitness for surgery and providing preoperative instructions.
arXiv Detail & Related papers (2024-10-11T00:34:20Z)
Towards Evaluating and Building Versatile Large Language Models for Medicine [57.49547766838095]
We present MedS-Bench, a benchmark designed to evaluate the performance of large language models (LLMs) in clinical contexts. MedS-Bench spans 11 high-level clinical tasks, including clinical report summarization, treatment recommendations, diagnosis, named entity recognition, and medical concept explanation. MedS-Ins comprises 58 medically oriented language corpora, totaling 13.5 million samples across 122 tasks.
arXiv Detail & Related papers (2024-08-22T17:01:34Z)
Pattern Recognition or Medical Knowledge? The Problem with Multiple-Choice Questions in Medicine [3.471944921180245]
Large Language Models (LLMs) demonstrate significant potential in the medical domain.<n>They are often evaluated using multiple-choice questions (MCQs) modeled on exams like the USMLE.<n>We created a fictional medical benchmark centered on an imaginary organ, the Glianorex, allowing us to separate memorized knowledge from reasoning ability.
arXiv Detail & Related papers (2024-06-04T15:08:56Z)
GPT-4 passes most of the 297 written Polish Board Certification Examinations [0.5461938536945723]
This study evaluated the performance of three Generative Pretrained Transformer (GPT) models on the Polish Board Certification Exam (Pa'nstwowy Egzamin Specjalizacyjny, PES) dataset. The GPT models varied significantly, displaying excellence in exams related to certain specialties while completely failing others.
arXiv Detail & Related papers (2024-04-29T09:08:22Z)
BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text [82.7001841679981]
BioMedLM is a 2.7 billion parameter GPT-style autoregressive model trained exclusively on PubMed abstracts and full articles. When fine-tuned, BioMedLM can produce strong multiple-choice biomedical question-answering results competitive with larger models. BioMedLM can also be fine-tuned to produce useful answers to patient questions on medical topics.
arXiv Detail & Related papers (2024-03-27T10:18:21Z)
Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine [89.46836590149883]
We build on a prior study of GPT-4's capabilities on medical challenge benchmarks in the absence of special training. We find that prompting innovation can unlock deeper specialist capabilities and show that GPT-4 easily tops prior leading results for medical benchmarks. With Medprompt, GPT-4 achieves state-of-the-art results on all nine of the benchmark datasets in the MultiMedQA suite.
arXiv Detail & Related papers (2023-11-28T03:16:12Z)
A Comparative Study of Open-Source Large Language Models, GPT-4 and Claude 2: Multiple-Choice Test Taking in Nephrology [0.6213359027997152]
The study was conducted to evaluate the ability of LLM models to provide correct answers to nephSAP multiple-choice questions. The findings of this study potentially have significant implications for the future medical training and patient care.
arXiv Detail & Related papers (2023-08-09T05:01:28Z)
PMC-LLaMA: Towards Building Open-source Language Models for Medicine [62.39105735933138]
Large Language Models (LLMs) have showcased remarkable capabilities in natural language understanding. LLMs struggle in domains that require precision, such as medical applications, due to their lack of domain-specific knowledge. We describe the procedure for building a powerful, open-source language model specifically designed for medicine applications, termed as PMC-LLaMA.
arXiv Detail & Related papers (2023-04-27T18:29:05Z)
GPT-4 can pass the Korean National Licensing Examination for Korean Medicine Doctors [9.374652839580182]
This study assessed the capabilities of GPT-4 in traditional Korean medicine (TKM) We optimized prompts with Chinese-term annotation, English translation for questions and instruction, exam-optimized instruction, and self-consistency. GPT-4 with optimized prompts achieved 66.18% accuracy, surpassing the examination's average pass mark of 60% and the 40% minimum for each subject.
arXiv Detail & Related papers (2023-03-31T05:43:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.