Towards Accurate Differential Diagnosis with Large Language Models
- URL: http://arxiv.org/abs/2312.00164v1
- Date: Thu, 30 Nov 2023 19:55:51 GMT
- Title: Towards Accurate Differential Diagnosis with Large Language Models
- Authors: Daniel McDuff and Mike Schaekermann and Tao Tu and Anil Palepu and Amy
Wang and Jake Garrison and Karan Singhal and Yash Sharma and Shekoofeh Azizi
and Kavita Kulkarni and Le Hou and Yong Cheng and Yun Liu and S Sara Mahdavi
and Sushant Prakash and Anupam Pathak and Christopher Semturs and Shwetak
Patel and Dale R Webster and Ewa Dominowska and Juraj Gottweis and Joelle
Barral and Katherine Chou and Greg S Corrado and Yossi Matias and Jake
Sunshine and Alan Karthikesalingam and Vivek Natarajan
- Abstract summary: Interactive interfaces powered by Large Language Models (LLMs) present new opportunities to both assist and automate aspects of differential diagnosis.
20 clinicians evaluated 302 challenging, real-world medical cases sourced from the New England Journal of Medicine.
Our study suggests that our LLM has potential to improve clinicians' diagnostic reasoning and accuracy in challenging cases.
- Score: 37.48155380562073
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: An accurate differential diagnosis (DDx) is a cornerstone of medical care,
often reached through an iterative process of interpretation that combines
clinical history, physical examination, investigations and procedures.
Interactive interfaces powered by Large Language Models (LLMs) present new
opportunities to both assist and automate aspects of this process. In this
study, we introduce an LLM optimized for diagnostic reasoning, and evaluate its
ability to generate a DDx alone or as an aid to clinicians. 20 clinicians
evaluated 302 challenging, real-world medical cases sourced from the New
England Journal of Medicine (NEJM) case reports. Each case report was read by
two clinicians, who were randomized to one of two assistive conditions: either
assistance from search engines and standard medical resources, or LLM
assistance in addition to these tools. All clinicians provided a baseline,
unassisted DDx prior to using the respective assistive tools. Our LLM for DDx
exhibited standalone performance that exceeded that of unassisted clinicians
(top-10 accuracy 59.1% vs 33.6%, [p = 0.04]). Comparing the two assisted study
arms, the DDx quality score was higher for clinicians assisted by our LLM
(top-10 accuracy 51.7%) compared to clinicians without its assistance (36.1%)
(McNemar's Test: 45.7, p < 0.01) and clinicians with search (44.4%) (4.75, p =
0.03). Further, clinicians assisted by our LLM arrived at more comprehensive
differential lists than those without its assistance. Our study suggests that
our LLM for DDx has potential to improve clinicians' diagnostic reasoning and
accuracy in challenging cases, meriting further real-world evaluation for its
ability to empower physicians and widen patients' access to specialist-level
expertise.
Related papers
- Language Models And A Second Opinion Use Case: The Pocket Professional [0.0]
This research tests the role of Large Language Models (LLMs) as formal second opinion tools in professional decision-making.
The work analyzed 183 challenging medical cases from Medscape over a 20-month period, testing multiple LLMs' performance against crowd-sourced physician responses.
arXiv Detail & Related papers (2024-10-27T23:48:47Z) - Towards Accountable AI-Assisted Eye Disease Diagnosis: Workflow Design, External Validation, and Continual Learning [5.940140611616894]
AI shows promise in diagnosis accuracy but faces real-world application issues due to insufficient validation in clinical and diverse populations.
This study addresses gaps in medical AI downstream accountability through a case study on age-related macular degeneration (AMD) diagnosis and classification severity.
arXiv Detail & Related papers (2024-09-23T15:01:09Z) - MAGDA: Multi-agent guideline-driven diagnostic assistance [43.15066219293877]
In emergency departments, rural hospitals, or clinics in less developed regions, clinicians often lack fast image analysis by trained radiologists.
In this work, we introduce a new approach for zero-shot guideline-driven decision support.
We model a system of multiple LLM agents augmented with a contrastive vision-language model that collaborate to reach a patient diagnosis.
arXiv Detail & Related papers (2024-09-10T09:10:30Z) - Towards Evaluating and Building Versatile Large Language Models for Medicine [57.49547766838095]
We present MedS-Bench, a benchmark designed to evaluate the performance of large language models (LLMs) in clinical contexts.
MedS-Bench spans 11 high-level clinical tasks, including clinical report summarization, treatment recommendations, diagnosis, named entity recognition, and medical concept explanation.
MedS-Ins comprises 58 medically oriented language corpora, totaling 13.5 million samples across 122 tasks.
arXiv Detail & Related papers (2024-08-22T17:01:34Z) - Automating PTSD Diagnostics in Clinical Interviews: Leveraging Large Language Models for Trauma Assessments [7.219693607724636]
We aim to tackle this shortage by integrating a customized large language model (LLM) into the workflow.
We collect 411 clinician-administered diagnostic interviews and devise a novel approach to obtain high-quality data.
We build a comprehensive framework to automate PTSD diagnostic assessments based on interview contents.
arXiv Detail & Related papers (2024-05-18T05:04:18Z) - Large Language Models in the Clinic: A Comprehensive Benchmark [63.21278434331952]
We build a benchmark ClinicBench to better understand large language models (LLMs) in the clinic.
We first collect eleven existing datasets covering diverse clinical language generation, understanding, and reasoning tasks.
We then construct six novel datasets and clinical tasks that are complex but common in real-world practice.
We conduct an extensive evaluation of twenty-two LLMs under both zero-shot and few-shot settings.
arXiv Detail & Related papers (2024-04-25T15:51:06Z) - AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator [69.51568871044454]
We introduce textbfAI Hospital, a framework simulating dynamic medical interactions between emphDoctor as player and NPCs.
This setup allows for realistic assessments of LLMs in clinical scenarios.
We develop the Multi-View Medical Evaluation benchmark, utilizing high-quality Chinese medical records and NPCs.
arXiv Detail & Related papers (2024-02-15T06:46:48Z) - Towards Conversational Diagnostic AI [32.84876349808714]
We introduce AMIE (Articulate Medical Intelligence Explorer), a Large Language Model (LLM) based AI system optimized for diagnostic dialogue.
AMIE uses a self-play based simulated environment with automated feedback mechanisms for scaling learning across diverse disease conditions.
AMIE demonstrated greater diagnostic accuracy and superior performance on 28 of 32 axes according to specialist physicians and 24 of 26 axes according to patient actors.
arXiv Detail & Related papers (2024-01-11T04:25:06Z) - MedAlign: A Clinician-Generated Dataset for Instruction Following with
Electronic Medical Records [60.35217378132709]
Large language models (LLMs) can follow natural language instructions with human-level fluency.
evaluating LLMs on realistic text generation tasks for healthcare remains challenging.
We introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data.
arXiv Detail & Related papers (2023-08-27T12:24:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.