Related papers: Superhuman performance of a large language model on the reasoning tasks of a physician

Superhuman performance of a large language model on the reasoning tasks of a physician

URL: http://arxiv.org/abs/2412.10849v1
Date: Sat, 14 Dec 2024 14:46:18 GMT
Title: Superhuman performance of a large language model on the reasoning tasks of a physician
Authors: Peter G. Brodeur, Thomas A. Buckley, Zahir Kanjee, Ethan Goh, Evelyn Bin Ling, Priyank Jain, Stephanie Cabral, Raja-Elie Abdulnour, Adrian Haimovich, Jason A. Freed, Andrew Olson, Daniel J. Morgan, Jason Hom, Robert Gallo, Eric Horvitz, Jonathan Chen, Arjun K. Manrai, Adam Rodman,
Abstract summary: Performance of large language models (LLMs) on medical tasks has traditionally been evaluated using multiple choice question benchmarks.<n>We evaluate OpenAI's o1-preview model, a model developed to increase run-time via chain of thought processes prior to generating a response.
Score: 10.043418251604624
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Performance of large language models (LLMs) on medical tasks has traditionally been evaluated using multiple choice question benchmarks. However, such benchmarks are highly constrained, saturated with repeated impressive performance by LLMs, and have an unclear relationship to performance in real clinical scenarios. Clinical reasoning, the process by which physicians employ critical thinking to gather and synthesize clinical data to diagnose and manage medical problems, remains an attractive benchmark for model performance. Prior LLMs have shown promise in outperforming clinicians in routine and complex diagnostic scenarios. We sought to evaluate OpenAI's o1-preview model, a model developed to increase run-time via chain of thought processes prior to generating a response. We characterize the performance of o1-preview with five experiments including differential diagnosis generation, display of diagnostic reasoning, triage differential diagnosis, probabilistic reasoning, and management reasoning, adjudicated by physician experts with validated psychometrics. Our primary outcome was comparison of the o1-preview output to identical prior experiments that have historical human controls and benchmarks of previous LLMs. Significant improvements were observed with differential diagnosis generation and quality of diagnostic and management reasoning. No improvements were observed with probabilistic reasoning or triage differential diagnosis. This study highlights o1-preview's ability to perform strongly on tasks that require complex critical thinking such as diagnosis and management while its performance on probabilistic reasoning tasks was similar to past models. New robust benchmarks and scalable evaluation of LLM capabilities compared to human physicians are needed along with trials evaluating AI in real clinical settings.

Related papers

ChestX-Reasoner: Advancing Radiology Foundation Models with Reasoning through Step-by-Step Verification [57.22053411719822]
ChestX-Reasoner is a radiology diagnosis MLLM designed to leverage process supervision mined directly from clinical reports. Our two-stage training framework combines supervised fine-tuning and reinforcement learning guided by process rewards to better align model reasoning with clinical standards.
arXiv Detail & Related papers (2025-04-29T16:48:23Z)
AnesBench: Multi-Dimensional Evaluation of LLM Reasoning in Anesthesiology [47.52685298426068]
We systematically evaluate the reasoning capabilities of large language models (LLMs) in anesthesiology. AnesBench is a cross-lingual benchmark designed to assess anesthesiology-related reasoning across three levels.
arXiv Detail & Related papers (2025-04-03T08:54:23Z)
m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning with Large Language Models [21.849783391186754]
We provide the first comprehensive investigation of test-time scaling for medical reasoning and present m1. Our evaluation across diverse medical tasks demonstrates that test-time scaling consistently enhances medical reasoning. We find that increasing data scale, improving data quality, and expanding model capacity consistently enhance medical knowledge grounding.
arXiv Detail & Related papers (2025-04-01T14:57:43Z)
MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning [34.93995619867384]
Large Language Models (LLMs) have shown impressive performance on existing medical question-answering benchmarks. We present MedAgentsBench, a benchmark that focuses on challenging medical questions requiring multi-step clinical reasoning, diagnosis formulation, and treatment planning-scenarios.
arXiv Detail & Related papers (2025-03-10T15:38:44Z)
Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references. We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey. Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z)
Improving Interactive Diagnostic Ability of a Large Language Model Agent Through Clinical Experience Learning [17.647875658030006]
This study investigates the underlying mechanisms behind the performance degradation phenomenon. We developed a plug-and-play method enhanced (PPME) LLM agent, leveraging over 3.5 million electronic medical records from Chinese and American healthcare facilities. Our approach integrates specialized models for initial disease diagnosis and inquiry into the history of the present illness, trained through supervised and reinforcement learning techniques.
arXiv Detail & Related papers (2025-02-24T06:24:20Z)
SemioLLM: Assessing Large Language Models for Semiological Analysis in Epilepsy Research [45.2233252981348]
Large Language Models have shown promising results in their ability to encode general medical knowledge. We test the ability of state-of-the-art LLMs to leverage their internal knowledge and reasoning for epilepsy diagnosis.
arXiv Detail & Related papers (2024-07-03T11:02:12Z)
Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA [24.10436440624249]
Large Multimodal Models (LMMs) have shown remarkable progress in medical Visual Question Answering (Med-VQA) This study reveals that when subjected to simple probing evaluation, state-of-the-art models perform worse than random guessing on medical diagnosis questions.
arXiv Detail & Related papers (2024-05-30T18:56:01Z)
AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator [69.51568871044454]
We introduce textbfAI Hospital, a framework simulating dynamic medical interactions between emphDoctor as player and NPCs. This setup allows for realistic assessments of LLMs in clinical scenarios. We develop the Multi-View Medical Evaluation benchmark, utilizing high-quality Chinese medical records and NPCs.
arXiv Detail & Related papers (2024-02-15T06:46:48Z)
Beyond Direct Diagnosis: LLM-based Multi-Specialist Agent Consultation for Automatic Diagnosis [30.943705201552643]
We propose a framework to model the diagnosis process in the real world by adaptively fusing probability distributions of agents over potential diseases. Our approach requires significantly less parameter updating and training time, enhancing efficiency and practical utility.
arXiv Detail & Related papers (2024-01-29T12:25:30Z)
Towards the Identifiability and Explainability for Personalized Learner Modeling: An Inductive Paradigm [36.60917255464867]
We propose an identifiable cognitive diagnosis framework (ID-CDF) based on a novel response-proficiency-response paradigm inspired by encoder-decoder models. We show that ID-CDF can effectively address the problems without loss of diagnosis preciseness.
arXiv Detail & Related papers (2023-09-01T07:18:02Z)
TREEMENT: Interpretable Patient-Trial Matching via Personalized Dynamic Tree-Based Memory Network [54.332862955411656]
Clinical trials are critical for drug development but often suffer from expensive and inefficient patient recruitment. In recent years, machine learning models have been proposed for speeding up patient recruitment via automatically matching patients with clinical trials. We introduce a dynamic tree-based memory network model named TREEMENT to provide accurate and interpretable patient trial matching.
arXiv Detail & Related papers (2023-07-19T12:35:09Z)
DR.BENCH: Diagnostic Reasoning Benchmark for Clinical Natural Language Processing [5.022185333260402]
Diagnostic Reasoning Benchmarks, DR.BENCH, is a new benchmark for developing and evaluating cNLP models with clinical diagnostic reasoning ability. DR.BENCH is the first clinical suite of tasks designed to be a natural language generation framework to evaluate pre-trained language models.
arXiv Detail & Related papers (2022-09-29T16:05:53Z)
Clinical Outcome Prediction from Admission Notes using Self-Supervised Knowledge Integration [55.88616573143478]
Outcome prediction from clinical text can prevent doctors from overlooking possible risks. Diagnoses at discharge, procedures performed, in-hospital mortality and length-of-stay prediction are four common outcome prediction targets. We propose clinical outcome pre-training to integrate knowledge about patient outcomes from multiple public sources.
arXiv Detail & Related papers (2021-02-08T10:26:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.