A Scalable Approach to Benchmarking the In-Conversation Differential Diagnostic Accuracy of a Health AI
- URL: http://arxiv.org/abs/2412.12538v1
- Date: Tue, 17 Dec 2024 05:02:33 GMT
- Title: A Scalable Approach to Benchmarking the In-Conversation Differential Diagnostic Accuracy of a Health AI
- Authors: Deep Bhatt, Surya Ayyagari, Anuruddh Mishra,
- Abstract summary: This study introduces a scalable benchmarking methodology for assessing health AI systems.
Our methodology employs 400 validated clinical vignettes across 14 medical specialties, using AI-powered patient actors to simulate realistic clinical interactions.
August achieved a top-one diagnostic accuracy of 81.8% (327/400 cases) and a top-two accuracy of 85.0% (340/400 cases), significantly outperforming traditional symptom checkers.
- Score: 0.0
- License:
- Abstract: Diagnostic errors in healthcare persist as a critical challenge, with increasing numbers of patients turning to online resources for health information. While AI-powered healthcare chatbots show promise, there exists no standardized and scalable framework for evaluating their diagnostic capabilities. This study introduces a scalable benchmarking methodology for assessing health AI systems and demonstrates its application through August, an AI-driven conversational chatbot. Our methodology employs 400 validated clinical vignettes across 14 medical specialties, using AI-powered patient actors to simulate realistic clinical interactions. In systematic testing, August achieved a top-one diagnostic accuracy of 81.8% (327/400 cases) and a top-two accuracy of 85.0% (340/400 cases), significantly outperforming traditional symptom checkers. The system demonstrated 95.8% accuracy in specialist referrals and required 47% fewer questions compared to conventional symptom checkers (mean 16 vs 29 questions), while maintaining empathetic dialogue throughout consultations. These findings demonstrate the potential of AI chatbots to enhance healthcare delivery, though implementation challenges remain regarding real-world validation and integration of objective clinical data. This research provides a reproducible framework for evaluating healthcare AI systems, contributing to the responsible development and deployment of AI in clinical settings.
Related papers
- Integrating Generative Artificial Intelligence in ADRD: A Framework for Streamlining Diagnosis and Care in Neurodegenerative Diseases [0.0]
We propose that large language models (LLMs) offer more immediately practical applications by enhancing clinicians' capabilities.
We present a framework for responsible AI integration that leverages LLMs' ability to communicate effectively with both patients and providers.
This approach prioritizes standardized, high-quality data collection to enable a system that learns from every patient encounter.
arXiv Detail & Related papers (2025-02-06T19:09:11Z) - Which Client is Reliable?: A Reliable and Personalized Prompt-based Federated Learning for Medical Image Question Answering [51.26412822853409]
We present a novel personalized federated learning (pFL) method for medical visual question answering (VQA) models.
Our method introduces learnable prompts into a Transformer architecture to efficiently train it on diverse medical datasets without massive computational costs.
arXiv Detail & Related papers (2024-10-23T00:31:17Z) - Detecting Bias and Enhancing Diagnostic Accuracy in Large Language Models for Healthcare [0.2302001830524133]
Biased AI-generated medical advice and misdiagnoses can jeopardize patient safety.
This study introduces new resources designed to promote ethical and precise AI in healthcare.
arXiv Detail & Related papers (2024-10-09T06:00:05Z) - GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals.
GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date.
It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z) - End-To-End Clinical Trial Matching with Large Language Models [0.6151041580858937]
We present an end-to-end pipeline for clinical trial matching using Large Language Models (LLMs)
Our approach identifies relevant candidate trials in 93.3% of cases and achieves a preliminary accuracy of 88.0%.
Our fully end-to-end pipeline can operate autonomously or with human supervision and is not restricted to oncology.
arXiv Detail & Related papers (2024-07-18T12:36:26Z) - TrialBench: Multi-Modal Artificial Intelligence-Ready Clinical Trial Datasets [57.067409211231244]
This paper presents meticulously curated AIready datasets covering multi-modal data (e.g., drug molecule, disease code, text, categorical/numerical features) and 8 crucial prediction challenges in clinical trial design.
We provide basic validation methods for each task to ensure the datasets' usability and reliability.
We anticipate that the availability of such open-access datasets will catalyze the development of advanced AI approaches for clinical trial design.
arXiv Detail & Related papers (2024-06-30T09:13:10Z) - Towards Conversational Diagnostic AI [32.84876349808714]
We introduce AMIE (Articulate Medical Intelligence Explorer), a Large Language Model (LLM) based AI system optimized for diagnostic dialogue.
AMIE uses a self-play based simulated environment with automated feedback mechanisms for scaling learning across diverse disease conditions.
AMIE demonstrated greater diagnostic accuracy and superior performance on 28 of 32 axes according to specialist physicians and 24 of 26 axes according to patient actors.
arXiv Detail & Related papers (2024-01-11T04:25:06Z) - Validating polyp and instrument segmentation methods in colonoscopy through Medico 2020 and MedAI 2021 Challenges [58.32937972322058]
"Medico automatic polyp segmentation (Medico 2020)" and "MedAI: Transparency in Medical Image (MedAI 2021)" competitions.
We present a comprehensive summary and analyze each contribution, highlight the strength of the best-performing methods, and discuss the possibility of clinical translations of such methods into the clinic.
arXiv Detail & Related papers (2023-07-30T16:08:45Z) - Exploring linguistic feature and model combination for speech
recognition based automatic AD detection [61.91708957996086]
Speech based automatic AD screening systems provide a non-intrusive and more scalable alternative to other clinical screening techniques.
Scarcity of specialist data leads to uncertainty in both model selection and feature learning when developing such systems.
This paper investigates the use of feature and model combination approaches to improve the robustness of domain fine-tuning of BERT and Roberta pre-trained text encoders.
arXiv Detail & Related papers (2022-06-28T05:09:01Z) - Detecting Spurious Correlations with Sanity Tests for Artificial
Intelligence Guided Radiology Systems [22.249702822013045]
A critical component to deploying AI in radiology is to gain confidence in a developed system's efficacy and safety.
The current gold standard approach is to conduct an analytical validation of performance on a generalization dataset.
We describe a series of sanity tests to identify when a system performs well on development data for the wrong reasons.
arXiv Detail & Related papers (2021-03-04T14:14:05Z) - Identification of Ischemic Heart Disease by using machine learning
technique based on parameters measuring Heart Rate Variability [50.591267188664666]
In this study, 18 non-invasive features (age, gender, left ventricular ejection fraction and 15 obtained from HRV) of 243 subjects were used to train and validate a series of several ANN.
The best result was obtained using 7 input parameters and 7 hidden nodes with an accuracy of 98.9% and 82% for the training and validation dataset.
arXiv Detail & Related papers (2020-10-29T19:14:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.