Towards Conversational Diagnostic AI
- URL: http://arxiv.org/abs/2401.05654v1
- Date: Thu, 11 Jan 2024 04:25:06 GMT
- Title: Towards Conversational Diagnostic AI
- Authors: Tao Tu, Anil Palepu, Mike Schaekermann, Khaled Saab, Jan Freyberg,
Ryutaro Tanno, Amy Wang, Brenna Li, Mohamed Amin, Nenad Tomasev, Shekoofeh
Azizi, Karan Singhal, Yong Cheng, Le Hou, Albert Webson, Kavita Kulkarni, S
Sara Mahdavi, Christopher Semturs, Juraj Gottweis, Joelle Barral, Katherine
Chou, Greg S Corrado, Yossi Matias, Alan Karthikesalingam and Vivek Natarajan
- Abstract summary: We introduce AMIE (Articulate Medical Intelligence Explorer), a Large Language Model (LLM) based AI system optimized for diagnostic dialogue.
AMIE uses a self-play based simulated environment with automated feedback mechanisms for scaling learning across diverse disease conditions.
AMIE demonstrated greater diagnostic accuracy and superior performance on 28 of 32 axes according to specialist physicians and 24 of 26 axes according to patient actors.
- Score: 32.84876349808714
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: At the heart of medicine lies the physician-patient dialogue, where skillful
history-taking paves the way for accurate diagnosis, effective management, and
enduring trust. Artificial Intelligence (AI) systems capable of diagnostic
dialogue could increase accessibility, consistency, and quality of care.
However, approximating clinicians' expertise is an outstanding grand challenge.
Here, we introduce AMIE (Articulate Medical Intelligence Explorer), a Large
Language Model (LLM) based AI system optimized for diagnostic dialogue.
AMIE uses a novel self-play based simulated environment with automated
feedback mechanisms for scaling learning across diverse disease conditions,
specialties, and contexts. We designed a framework for evaluating
clinically-meaningful axes of performance including history-taking, diagnostic
accuracy, management reasoning, communication skills, and empathy. We compared
AMIE's performance to that of primary care physicians (PCPs) in a randomized,
double-blind crossover study of text-based consultations with validated patient
actors in the style of an Objective Structured Clinical Examination (OSCE). The
study included 149 case scenarios from clinical providers in Canada, the UK,
and India, 20 PCPs for comparison with AMIE, and evaluations by specialist
physicians and patient actors. AMIE demonstrated greater diagnostic accuracy
and superior performance on 28 of 32 axes according to specialist physicians
and 24 of 26 axes according to patient actors. Our research has several
limitations and should be interpreted with appropriate caution. Clinicians were
limited to unfamiliar synchronous text-chat which permits large-scale
LLM-patient interactions but is not representative of usual clinical practice.
While further research is required before AMIE could be translated to
real-world settings, the results represent a milestone towards conversational
diagnostic AI.
Related papers
- Exploring Large Language Models for Specialist-level Oncology Care [17.34069859182619]
We probe the performance of AMIE, a research conversational diagnostic AI system, in the subspecialist domain of breast oncology care.
We curated a set of 50 synthetic breast cancer vignettes representing a range of treatment-naive and treatment-refractory cases.
We developed a detailed clinical rubric for evaluating management plans, including axes such as the quality of case summarization, safety of the proposed care plan, and recommendations for chemotherapy, radiotherapy, surgery and hormonal therapy.
arXiv Detail & Related papers (2024-11-05T18:30:13Z) - A Two-Stage Proactive Dialogue Generator for Efficient Clinical Information Collection Using Large Language Model [0.6926413609535759]
We propose a diagnostic dialogue system to automate the patient information collection procedure.
By exploiting medical history and conversation logic, our conversation agents can pose multi-round clinical queries.
Our experimental results on a real-world medical conversation dataset show that our model can generate clinical queries that mimic the conversation style of real doctors.
arXiv Detail & Related papers (2024-10-02T19:32:11Z) - RuleAlign: Making Large Language Models Better Physicians with Diagnostic Rule Alignment [54.91736546490813]
We introduce the RuleAlign framework, designed to align Large Language Models with specific diagnostic rules.
We develop a medical dialogue dataset comprising rule-based communications between patients and physicians.
Experimental results demonstrate the effectiveness of the proposed approach.
arXiv Detail & Related papers (2024-08-22T17:44:40Z) - GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals.
GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date.
It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z) - Conversational Disease Diagnosis via External Planner-Controlled Large Language Models [18.93345199841588]
This study presents a LLM-based diagnostic system that enhances planning capabilities by emulating doctors.
By utilizing real patient electronic medical record data, we constructed simulated dialogues between virtual patients and doctors.
arXiv Detail & Related papers (2024-04-04T06:16:35Z) - AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator [69.51568871044454]
We introduce textbfAI Hospital, a framework simulating dynamic medical interactions between emphDoctor as player and NPCs.
This setup allows for realistic assessments of LLMs in clinical scenarios.
We develop the Multi-View Medical Evaluation benchmark, utilizing high-quality Chinese medical records and NPCs.
arXiv Detail & Related papers (2024-02-15T06:46:48Z) - Beyond Direct Diagnosis: LLM-based Multi-Specialist Agent Consultation
for Automatic Diagnosis [30.943705201552643]
We propose a framework to model the diagnosis process in the real world by adaptively fusing probability distributions of agents over potential diseases.
Our approach requires significantly less parameter updating and training time, enhancing efficiency and practical utility.
arXiv Detail & Related papers (2024-01-29T12:25:30Z) - Informing clinical assessment by contextualizing post-hoc explanations
of risk prediction models in type-2 diabetes [50.8044927215346]
We consider a comorbidity risk prediction scenario and focus on contexts regarding the patients clinical state.
We employ several state-of-the-art LLMs to present contexts around risk prediction model inferences and evaluate their acceptability.
Our paper is one of the first end-to-end analyses identifying the feasibility and benefits of contextual explanations in a real-world clinical use case.
arXiv Detail & Related papers (2023-02-11T18:07:11Z) - Benchmarking Automated Clinical Language Simplification: Dataset,
Algorithm, and Evaluation [48.87254340298189]
We construct a new dataset named MedLane to support the development and evaluation of automated clinical language simplification approaches.
We propose a new model called DECLARE that follows the human annotation procedure and achieves state-of-the-art performance.
arXiv Detail & Related papers (2020-12-04T06:09:02Z) - MedDG: An Entity-Centric Medical Consultation Dataset for Entity-Aware
Medical Dialogue Generation [86.38736781043109]
We build and release a large-scale high-quality Medical Dialogue dataset related to 12 types of common Gastrointestinal diseases named MedDG.
We propose two kinds of medical dialogue tasks based on MedDG dataset. One is the next entity prediction and the other is the doctor response generation.
Experimental results show that the pre-train language models and other baselines struggle on both tasks with poor performance in our dataset.
arXiv Detail & Related papers (2020-10-15T03:34:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.