Evaluation of AI Chatbots for Patient-Specific EHR Questions
- URL: http://arxiv.org/abs/2306.02549v1
- Date: Mon, 5 Jun 2023 02:52:54 GMT
- Title: Evaluation of AI Chatbots for Patient-Specific EHR Questions
- Authors: Alaleh Hamidi and Kirk Roberts
- Abstract summary: We use several large language model (LLM) based systems: ChatGPT (versions 3.5 and 4), Google Bard, and Claude.
We evaluate the accuracy, relevance, comprehensiveness, and coherence of the answers generated by each model using a 5-point Likert scale on a set of patient-specific questions.
- Score: 5.195779994399724
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper investigates the use of artificial intelligence chatbots for
patient-specific question answering (QA) from clinical notes using several
large language model (LLM) based systems: ChatGPT (versions 3.5 and 4), Google
Bard, and Claude. We evaluate the accuracy, relevance, comprehensiveness, and
coherence of the answers generated by each model using a 5-point Likert scale
on a set of patient-specific questions.
Related papers
- Automated Evaluation can Distinguish the Good and Bad AI Responses to Patient Questions about Hospitalization [8.450904497835262]
Current gold standard for evaluating AI responses is labor-intensive and slow.<n>We conducted a large systematic study of evaluation approaches.<n>Our findings suggest that carefully designed automated evaluation can scale comparative assessment of AI systems.
arXiv Detail & Related papers (2025-10-01T02:39:37Z) - MedQARo: A Large-Scale Benchmark for Medical Question Answering in Romanian [50.767415194856135]
We introduce MedQARo, the first large-scale medical QA benchmark in Romanian.<n>We construct a high-quality and large-scale dataset comprising 102,646 QA pairs related to cancer patients.
arXiv Detail & Related papers (2025-08-22T13:48:37Z) - A Dataset for Addressing Patient's Information Needs related to Clinical Course of Hospitalization [15.837772594006038]
ArchEHR-QA is an expert-annotated dataset based on real-world patient cases from intensive care unit and emergency department settings.<n>Cases comprise questions posed by patients to public health forums, clinician-interpreted counterparts, relevant clinical note excerpts with sentence-level relevance annotations, and clinician-authored answers.<n>The answer-first prompting approach consistently performed best, with Llama 4 achieving the highest scores.
arXiv Detail & Related papers (2025-06-04T16:55:08Z) - Towards Leveraging Large Language Models for Automated Medical Q&A Evaluation [2.7379431425414693]
This paper explores the potential of using Large Language Models (LLMs) to automate the evaluation of responses in medical Question and Answer (Q&A) systems.
arXiv Detail & Related papers (2024-09-03T14:38:29Z) - LLM Questionnaire Completion for Automatic Psychiatric Assessment [49.1574468325115]
We employ a Large Language Model (LLM) to convert unstructured psychological interviews into structured questionnaires spanning various psychiatric and personality domains.
The obtained answers are coded as features, which are used to predict standardized psychiatric measures of depression (PHQ-8) and PTSD (PCL-C)
arXiv Detail & Related papers (2024-06-09T09:03:11Z) - Quriosity: Analyzing Human Questioning Behavior and Causal Inquiry through Curiosity-Driven Queries [91.70689724416698]
We present Quriosity, a collection of 13.5K naturally occurring questions from three diverse sources.
Our analysis reveals a significant presence of causal questions (up to 42%) in the dataset.
arXiv Detail & Related papers (2024-05-30T17:55:28Z) - Can Generative AI Support Patients' & Caregivers' Informational Needs? Towards Task-Centric Evaluation Of AI Systems [0.7124736158080937]
We develop an evaluation paradigm that centers human understanding and decision-making.
We study the utility of generative AI systems in supporting people in a concrete task.
We evaluate two state-of-the-art generative AI systems against the radiologist's responses.
arXiv Detail & Related papers (2024-01-31T23:24:37Z) - K-QA: A Real-World Medical Q&A Benchmark [12.636564634626422]
We construct K-QA, a dataset containing 1,212 patient questions originating from real-world conversations held on K Health.
We employ a panel of in-house physicians to answer and manually decompose a subset of K-QA into self-contained statements.
We evaluate several state-of-the-art models, as well as the effect of in-context learning and medically-oriented augmented retrieval schemes.
arXiv Detail & Related papers (2024-01-25T20:11:04Z) - Quality of Answers of Generative Large Language Models vs Peer Patients
for Interpreting Lab Test Results for Lay Patients: Evaluation Study [5.823006266363981]
Large language models (LLMs) have opened a promising avenue for patients to get their questions answered.
We generated responses to 53 questions from four LLMs including GPT-4, Meta LLaMA 2, MedAlpaca, and ORCA_mini.
We find that GPT-4's responses are more accurate, helpful, relevant, and safer.
arXiv Detail & Related papers (2024-01-23T22:03:51Z) - A General-purpose AI Avatar in Healthcare [1.5081825869395544]
This paper focuses on the role of chatbots in healthcare and explores the use of avatars to make AI interactions more appealing to patients.
A framework of a general-purpose AI avatar application is demonstrated by using a three-category prompt dictionary and prompt improvement mechanism.
A two-phase approach is suggested to fine-tune a general-purpose AI language model and create different AI avatars to discuss medical issues with users.
arXiv Detail & Related papers (2024-01-10T03:44:15Z) - SQUARE: Automatic Question Answering Evaluation using Multiple Positive
and Negative References [73.67707138779245]
We propose a new evaluation metric: SQuArE (Sentence-level QUestion AnsweRing Evaluation)
We evaluate SQuArE on both sentence-level extractive (Answer Selection) and generative (GenQA) QA systems.
arXiv Detail & Related papers (2023-09-21T16:51:30Z) - To ChatGPT, or not to ChatGPT: That is the question! [78.407861566006]
This study provides a comprehensive and contemporary assessment of the most recent techniques in ChatGPT detection.
We have curated a benchmark dataset consisting of prompts from ChatGPT and humans, including diverse questions from medical, open Q&A, and finance domains.
Our evaluation results demonstrate that none of the existing methods can effectively detect ChatGPT-generated content.
arXiv Detail & Related papers (2023-04-04T03:04:28Z) - Human Evaluation and Correlation with Automatic Metrics in Consultation
Note Generation [56.25869366777579]
In recent years, machine learning models have rapidly become better at generating clinical consultation notes.
We present an extensive human evaluation study where 5 clinicians listen to 57 mock consultations, write their own notes, post-edit a number of automatically generated notes, and extract all the errors.
We find that a simple, character-based Levenshtein distance metric performs on par if not better than common model-based metrics like BertScore.
arXiv Detail & Related papers (2022-04-01T14:04:16Z) - Knowledge Grounded Conversational Symptom Detection with Graph Memory
Networks [5.788153402669881]
We build a system that can interact with patients through dialog to detect and collect clinical symptoms automatically.
Given a set of explicit symptoms provided by the patient to initiate a dialog for diagnosing, the system is trained to collect implicit symptoms by asking questions.
After getting the reply from the patient for each question, the system also decides whether current information is enough for a human doctor to make a diagnosis.
arXiv Detail & Related papers (2021-01-24T18:50:16Z) - Where's the Question? A Multi-channel Deep Convolutional Neural Network
for Question Identification in Textual Data [83.89578557287658]
We propose a novel multi-channel deep convolutional neural network architecture, namely Quest-CNN, for the purpose of separating real questions.
We conducted a comprehensive performance comparison analysis of the proposed network against other deep neural networks.
The proposed Quest-CNN achieved the best F1 score both on a dataset of data entry-review dialogue in a dialysis care setting, and on a general domain dataset.
arXiv Detail & Related papers (2020-10-15T15:11:22Z) - Investigation of Sentiment Controllable Chatbot [50.34061353512263]
In this paper, we investigate four models to scale or adjust the sentiment of the response.
The models are a persona-based model, reinforcement learning, a plug and play model, and CycleGAN.
We develop machine-evaluated metrics to estimate whether the responses are reasonable given the input.
arXiv Detail & Related papers (2020-07-11T16:04:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.