Advancing Medical Artificial Intelligence Using a Century of Cases
- URL: http://arxiv.org/abs/2509.12194v1
- Date: Mon, 15 Sep 2025 17:54:51 GMT
- Title: Advancing Medical Artificial Intelligence Using a Century of Cases
- Authors: Thomas A. Buckley, Riccardo Conci, Peter G. Brodeur, Jason Gusdorf, Sourik Beltrán, Bita Behrouzi, Byron Crowe, Jacob Dockterman, Muzzammil Muhammad, Sarah Ohnigian, Andrew Sanchez, James A. Diao, Aashna P. Shah, Daniel Restrepo, Eric S. Rosenberg, Andrew S. Lea, Marinka Zitnik, Scott H. Podolsky, Zahir Kanjee, Raja-Elie E. Abdulnour, Jacob M. Koshy, Adam Rodman, Arjun K. Manrai,
- Abstract summary: Previous AI evaluations focused on final diagnoses without addressing multifaceted reasoning and presentation skills.<n>We created CPC-Bench, a benchmark spanning 10 text-based and multimodal tasks.<n>We developed "Dr. CaBot," an AI discussant designed to produce written slide-based video presentations.
- Score: 8.82283040766685
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: BACKGROUND: For over a century, the New England Journal of Medicine Clinicopathological Conferences (CPCs) have tested the reasoning of expert physicians and, recently, artificial intelligence (AI). However, prior AI evaluations have focused on final diagnoses without addressing the multifaceted reasoning and presentation skills required of expert discussants. METHODS: Using 7102 CPCs (1923-2025) and 1021 Image Challenges (2006-2025), we conducted extensive physician annotation and automated processing to create CPC-Bench, a physician-validated benchmark spanning 10 text-based and multimodal tasks, against which we evaluated leading large language models (LLMs). Then, we developed "Dr. CaBot," an AI discussant designed to produce written and slide-based video presentations using only the case presentation, modeling the role of the human expert in these cases. RESULTS: When challenged with 377 contemporary CPCs, o3 (OpenAI) ranked the final diagnosis first in 60% of cases and within the top ten in 84% of cases, outperforming a 20-physician baseline; next-test selection accuracy reached 98%. Event-level physician annotations quantified AI diagnostic accuracy per unit of information. Performance was lower on literature search and image tasks; o3 and Gemini 2.5 Pro (Google) achieved 67% accuracy on image challenges. In blinded comparisons of CaBot vs. human expert-generated text, physicians misclassified the source of the differential in 46 of 62 (74%) of trials, and scored CaBot more favorably across quality dimensions. To promote research, we are releasing CaBot and CPC-Bench. CONCLUSIONS: LLMs exceed physician performance on complex text-based differential diagnosis and convincingly emulate expert medical presentations, but image interpretation and literature retrieval remain weaker. CPC-Bench and CaBot may enable transparent and continued tracking of progress in medical AI.
Related papers
- AI-assisted workflow enables rapid, high-fidelity breast cancer clinical trial eligibility prescreening [4.008304844602351]
We developed MSK-MATCH (Memorial Sloan Kettering Multi-Agent Trial Coordination Hub), an AI system for automated eligibility screening from clinical text.<n>MSK-MATCH integrates a large language model with a curated oncology trial knowledge base and retrieval-augmented architecture.<n>In a retrospective dataset of 88,518 clinical documents from 731 patients across six breast cancer trials, MSK-MATCH automatically resolved 61.9% of cases and triaged 38.1% for human review.
arXiv Detail & Related papers (2025-11-07T20:27:05Z) - DispatchMAS: Fusing taxonomy and artificial intelligence agents for emergency medical services [49.70819009392778]
Large Language Models (LLMs) and Multi-Agent Systems (MAS) offer opportunities to augment dispatchers.<n>This study aimed to develop and evaluate a taxonomy-grounded, multi-agent system for simulating realistic scenarios.
arXiv Detail & Related papers (2025-10-24T08:01:21Z) - Toward the Autonomous AI Doctor: Quantitative Benchmarking of an Autonomous Agentic AI Versus Board-Certified Clinicians in a Real World Setting [0.0]
Globally we face a projected shortage of 11 million healthcare practitioners by 2030.<n>No end-to-end autonomous large language model (LLM)-based AI system has been rigorously evaluated in real-world clinical practice.
arXiv Detail & Related papers (2025-06-27T19:04:44Z) - Sequential Diagnosis with Language Models [21.22416732642907]
We introduce the Sequential Diagnosis Benchmark, which transforms 304 diagnostically challenging cases into stepwise diagnostic encounters.<n>Performance is assessed not just by diagnostic accuracy but also by the cost of physician visits and tests performed.<n>We also present the MAI Diagnostic Orchestrator (MAI-DxO), a model-agnostic orchestrator that simulates a panel of physicians.
arXiv Detail & Related papers (2025-06-27T17:27:26Z) - An Agentic System for Rare Disease Diagnosis with Traceable Reasoning [69.46279475491164]
We introduce DeepRare, the first rare disease diagnosis agentic system powered by a large language model (LLM)<n>DeepRare generates ranked diagnostic hypotheses for rare diseases, each accompanied by a transparent chain of reasoning.<n>The system demonstrates exceptional diagnostic performance among 2,919 diseases, achieving 100% accuracy for 1013 diseases.
arXiv Detail & Related papers (2025-06-25T13:42:26Z) - A Scalable Approach to Benchmarking the In-Conversation Differential Diagnostic Accuracy of a Health AI [0.0]
This study introduces a scalable benchmarking methodology for assessing health AI systems.<n>Our methodology employs 400 validated clinical vignettes across 14 medical specialties, using AI-powered patient actors to simulate realistic clinical interactions.<n>August achieved a top-one diagnostic accuracy of 81.8% (327/400 cases) and a top-two accuracy of 85.0% (340/400 cases), significantly outperforming traditional symptom checkers.
arXiv Detail & Related papers (2024-12-17T05:02:33Z) - Towards Conversational Diagnostic AI [32.84876349808714]
We introduce AMIE (Articulate Medical Intelligence Explorer), a Large Language Model (LLM) based AI system optimized for diagnostic dialogue.
AMIE uses a self-play based simulated environment with automated feedback mechanisms for scaling learning across diverse disease conditions.
AMIE demonstrated greater diagnostic accuracy and superior performance on 28 of 32 axes according to specialist physicians and 24 of 26 axes according to patient actors.
arXiv Detail & Related papers (2024-01-11T04:25:06Z) - BiomedGPT: A Generalist Vision-Language Foundation Model for Diverse Biomedical Tasks [68.39821375903591]
Generalist AI holds the potential to address limitations due to its versatility in interpreting different data types.
Here, we propose BiomedGPT, the first open-source and lightweight vision-language foundation model.
arXiv Detail & Related papers (2023-05-26T17:14:43Z) - Towards the Use of Saliency Maps for Explaining Low-Quality Electrocardiograms to End Users [51.644376281196394]
When using medical images for diagnosis, it is important that the images are of high quality.<n>In telemedicine, a common problem is that the quality issue is only flagged once the patient has left the clinic, meaning they must return in order to have the exam redone.<n>This paper reports on the development of an AI system for flagging and explaining low-quality medical images in real-time.
arXiv Detail & Related papers (2022-07-06T14:53:26Z) - Advancing COVID-19 Diagnosis with Privacy-Preserving Collaboration in
Artificial Intelligence [79.038671794961]
We launch the Unified CT-COVID AI Diagnostic Initiative (UCADI), where the AI model can be distributedly trained and independently executed at each host institution.
Our study is based on 9,573 chest computed tomography scans (CTs) from 3,336 patients collected from 23 hospitals located in China and the UK.
arXiv Detail & Related papers (2021-11-18T00:43:41Z) - Review of Artificial Intelligence Techniques in Imaging Data
Acquisition, Segmentation and Diagnosis for COVID-19 [71.41929762209328]
The pandemic of coronavirus disease 2019 (COVID-19) is spreading all over the world.
Medical imaging such as X-ray and computed tomography (CT) plays an essential role in the global fight against COVID-19.
The recently emerging artificial intelligence (AI) technologies further strengthen the power of the imaging tools and help medical specialists.
arXiv Detail & Related papers (2020-04-06T15:21:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.