Challenges of GPT-3-based Conversational Agents for Healthcare
- URL: http://arxiv.org/abs/2308.14641v2
- Date: Tue, 29 Aug 2023 15:48:23 GMT
- Title: Challenges of GPT-3-based Conversational Agents for Healthcare
- Authors: Fabian Lechner and Allison Lahnala and Charles Welch and Lucie Flek
- Abstract summary: This paper investigates the challenges and risks of using GPT-3-based models for medical question-answering (MedQA)
We provide a procedure for manually designing patient queries to stress-test high-risk limitations of LLMs in MedQA systems.
Our analysis reveals that LLMs fail to respond adequately to these queries, generating erroneous medical information, unsafe recommendations, and content that may be considered offensive.
- Score: 11.517862889784293
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The potential to provide patients with faster information access while
allowing medical specialists to concentrate on critical tasks makes medical
domain dialog agents appealing. However, the integration of large-language
models (LLMs) into these agents presents certain limitations that may result in
serious consequences. This paper investigates the challenges and risks of using
GPT-3-based models for medical question-answering (MedQA). We perform several
evaluations contextualized in terms of standard medical principles. We provide
a procedure for manually designing patient queries to stress-test high-risk
limitations of LLMs in MedQA systems. Our analysis reveals that LLMs fail to
respond adequately to these queries, generating erroneous medical information,
unsafe recommendations, and content that may be considered offensive.
Related papers
- Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment [108.55277188617035]
Large language models (LLMs) have been widely adopted in various downstream task domains, but their ability to directly recall and apply factual medical knowledge remains under-explored.
Most existing medical QA benchmarks assess complex reasoning or multi-hop inference, making it difficult to isolate LLMs' inherent medical knowledge from their reasoning capabilities.
We introduce the Medical Knowledge Judgment, a dataset specifically designed to measure LLMs' one-hop factual medical knowledge.
arXiv Detail & Related papers (2025-02-20T05:27:51Z) - MeDiSumQA: Patient-Oriented Question-Answer Generation from Discharge Letters [1.6135243915480502]
Large language models (LLMs) offer solutions by simplifying medical information.
evaluating LLMs for safe and patient-friendly text generation is difficult due to the lack of standardized evaluation resources.
MeDiSumQA is a dataset created from MIMIC-IV discharge summaries through an automated pipeline.
arXiv Detail & Related papers (2025-02-05T15:56:37Z) - Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering [70.44269982045415]
Retrieval-augmented generation (RAG) has emerged as a promising approach to enhance the performance of large language models (LLMs)
We introduce Medical Retrieval-Augmented Generation Benchmark (MedRGB) that provides various supplementary elements to four medical QA datasets.
Our experimental results reveals current models' limited ability to handle noise and misinformation in the retrieved documents.
arXiv Detail & Related papers (2024-11-14T06:19:18Z) - Can LLMs Correct Physicians, Yet? Investigating Effective Interaction Methods in the Medical Domain [21.96129653695565]
Large Language Models (LLMs) can assist and potentially correct physicians in medical decision-making tasks.
We evaluate several LLMs, including Meditron, Llama2, and Mistral, to analyze the ability of these models to interact effectively with physicians across different scenarios.
arXiv Detail & Related papers (2024-03-29T16:59:13Z) - A Spectrum Evaluation Benchmark for Medical Multi-Modal Large Language Models [57.88111980149541]
We introduce Asclepius, a novel Med-MLLM benchmark that assesses Med-MLLMs in terms of distinct medical specialties and different diagnostic capacities.
Grounded in 3 proposed core principles, Asclepius ensures a comprehensive evaluation by encompassing 15 medical specialties.
We also provide an in-depth analysis of 6 Med-MLLMs and compare them with 3 human specialists.
arXiv Detail & Related papers (2024-02-17T08:04:23Z) - AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator [69.51568871044454]
We introduce textbfAI Hospital, a framework simulating dynamic medical interactions between emphDoctor as player and NPCs.
This setup allows for realistic assessments of LLMs in clinical scenarios.
We develop the Multi-View Medical Evaluation benchmark, utilizing high-quality Chinese medical records and NPCs.
arXiv Detail & Related papers (2024-02-15T06:46:48Z) - ChatFDA: Medical Records Risk Assessment [0.0]
This study explores a pioneering application aimed at addressing this challenge by assisting caregivers in gauging potential risks derived from medical notes.
The application leverages data from openFDA, delivering real-time, actionable insights regarding prescriptions.
Preliminary analyses conducted on the MIMIC-III citemimic dataset affirm a proof of concept highlighting a reduction in medical errors and an amplification in patient safety.
arXiv Detail & Related papers (2023-12-20T03:40:45Z) - ChiMed-GPT: A Chinese Medical Large Language Model with Full Training Regime and Better Alignment to Human Preferences [51.66185471742271]
We propose ChiMed-GPT, a benchmark LLM designed explicitly for Chinese medical domain.
ChiMed-GPT undergoes a comprehensive training regime with pre-training, SFT, and RLHF.
We analyze possible biases through prompting ChiMed-GPT to perform attitude scales regarding discrimination of patients.
arXiv Detail & Related papers (2023-11-10T12:25:32Z) - MedAlign: A Clinician-Generated Dataset for Instruction Following with
Electronic Medical Records [60.35217378132709]
Large language models (LLMs) can follow natural language instructions with human-level fluency.
evaluating LLMs on realistic text generation tasks for healthcare remains challenging.
We introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data.
arXiv Detail & Related papers (2023-08-27T12:24:39Z) - Appraising the Potential Uses and Harms of LLMs for Medical Systematic
Reviews [21.546144601311187]
Large language models (LLMs) offer potential to automatically generate literature reviews on demand.
LLMs sometimes generate inaccurate (and potentially misleading) texts by hallucination or omission.
arXiv Detail & Related papers (2023-05-19T17:09:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.