Related papers: Challenges of GPT-3-based Conversational Agents for Healthcare

Challenges of GPT-3-based Conversational Agents for Healthcare

URL: http://arxiv.org/abs/2308.14641v2
Date: Tue, 29 Aug 2023 15:48:23 GMT
Title: Challenges of GPT-3-based Conversational Agents for Healthcare
Authors: Fabian Lechner and Allison Lahnala and Charles Welch and Lucie Flek
Abstract summary: This paper investigates the challenges and risks of using GPT-3-based models for medical question-answering (MedQA) We provide a procedure for manually designing patient queries to stress-test high-risk limitations of LLMs in MedQA systems. Our analysis reveals that LLMs fail to respond adequately to these queries, generating erroneous medical information, unsafe recommendations, and content that may be considered offensive.
Score: 11.517862889784293
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The potential to provide patients with faster information access while allowing medical specialists to concentrate on critical tasks makes medical domain dialog agents appealing. However, the integration of large-language models (LLMs) into these agents presents certain limitations that may result in serious consequences. This paper investigates the challenges and risks of using GPT-3-based models for medical question-answering (MedQA). We perform several evaluations contextualized in terms of standard medical principles. We provide a procedure for manually designing patient queries to stress-test high-risk limitations of LLMs in MedQA systems. Our analysis reveals that LLMs fail to respond adequately to these queries, generating erroneous medical information, unsafe recommendations, and content that may be considered offensive.

Related papers

Cancer-Myth: Evaluating AI Chatbot on Patient Questions with False Presuppositions [16.21971764311474]
We evaluate large language models (LLMs) on cancer-related questions drawn from real patients. LLMs frequently fail to recognize or address false presuppositions in the questions. These findings expose a critical gap in the clinical reliability of LLMs.
arXiv Detail & Related papers (2025-04-15T16:37:32Z)
TAMA: A Human-AI Collaborative Thematic Analysis Framework Using Multi-Agent LLMs for Clinical Interviews [54.35097932763878]
Thematic analysis (TA) is a widely used qualitative approach for uncovering latent meanings in unstructured text data. Here, we propose TAMA: A Human-AI Collaborative Thematic Analysis framework using Multi-Agent LLMs for clinical interviews. We demonstrate that TAMA outperforms existing LLM-assisted TA approaches, achieving higher thematic hit rate, coverage, and distinctiveness.
arXiv Detail & Related papers (2025-03-26T15:58:16Z)
Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions. We propose a novel approach utilizing structured medical reasoning. Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z)
MeDiSumQA: Patient-Oriented Question-Answer Generation from Discharge Letters [1.6135243915480502]
Large language models (LLMs) offer solutions by simplifying medical information. evaluating LLMs for safe and patient-friendly text generation is difficult due to the lack of standardized evaluation resources. MeDiSumQA is a dataset created from MIMIC-IV discharge summaries through an automated pipeline.
arXiv Detail & Related papers (2025-02-05T15:56:37Z)
Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering [70.44269982045415]
Retrieval-augmented generation (RAG) has emerged as a promising approach to enhance the performance of large language models (LLMs) We introduce Medical Retrieval-Augmented Generation Benchmark (MedRGB) that provides various supplementary elements to four medical QA datasets. Our experimental results reveals current models' limited ability to handle noise and misinformation in the retrieved documents.
arXiv Detail & Related papers (2024-11-14T06:19:18Z)
RuleAlign: Making Large Language Models Better Physicians with Diagnostic Rule Alignment [54.91736546490813]
We introduce the RuleAlign framework, designed to align Large Language Models with specific diagnostic rules. We develop a medical dialogue dataset comprising rule-based communications between patients and physicians. Experimental results demonstrate the effectiveness of the proposed approach.
arXiv Detail & Related papers (2024-08-22T17:44:40Z)
Improving Retrieval-Augmented Generation in Medicine with Iterative Follow-up Questions [42.73799041840482]
i-MedRAG is a system that iteratively asks follow-up queries based on previous information-seeking attempts. Our zero-shot i-MedRAG outperforms all existing prompt engineering and fine-tuning methods on GPT-3.5. i-MedRAG can flexibly ask follow-up queries to form reasoning chains, providing an in-depth analysis of medical questions.
arXiv Detail & Related papers (2024-08-01T17:18:17Z)
Can LLMs Correct Physicians, Yet? Investigating Effective Interaction Methods in the Medical Domain [21.96129653695565]
Large Language Models (LLMs) can assist and potentially correct physicians in medical decision-making tasks. We evaluate several LLMs, including Meditron, Llama2, and Mistral, to analyze the ability of these models to interact effectively with physicians across different scenarios.
arXiv Detail & Related papers (2024-03-29T16:59:13Z)
Asclepius: A Spectrum Evaluation Benchmark for Medical Multi-Modal Large Language Models [59.60384461302662]
We introduce Asclepius, a novel benchmark for evaluating Medical Multi-Modal Large Language Models (Med-MLLMs) Asclepius rigorously and comprehensively assesses model capability in terms of distinct medical specialties and different diagnostic capacities. We also provide an in-depth analysis of 6 Med-MLLMs and compare them with 5 human specialists.
arXiv Detail & Related papers (2024-02-17T08:04:23Z)
AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator [69.51568871044454]
We introduce textbfAI Hospital, a framework simulating dynamic medical interactions between emphDoctor as player and NPCs. This setup allows for realistic assessments of LLMs in clinical scenarios. We develop the Multi-View Medical Evaluation benchmark, utilizing high-quality Chinese medical records and NPCs.
arXiv Detail & Related papers (2024-02-15T06:46:48Z)
ChatFDA: Medical Records Risk Assessment [0.0]
This study explores a pioneering application aimed at addressing this challenge by assisting caregivers in gauging potential risks derived from medical notes. The application leverages data from openFDA, delivering real-time, actionable insights regarding prescriptions. Preliminary analyses conducted on the MIMIC-III citemimic dataset affirm a proof of concept highlighting a reduction in medical errors and an amplification in patient safety.
arXiv Detail & Related papers (2023-12-20T03:40:45Z)
ChiMed-GPT: A Chinese Medical Large Language Model with Full Training Regime and Better Alignment to Human Preferences [51.66185471742271]
We propose ChiMed-GPT, a benchmark LLM designed explicitly for Chinese medical domain. ChiMed-GPT undergoes a comprehensive training regime with pre-training, SFT, and RLHF. We analyze possible biases through prompting ChiMed-GPT to perform attitude scales regarding discrimination of patients.
arXiv Detail & Related papers (2023-11-10T12:25:32Z)
MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records [60.35217378132709]
Large language models (LLMs) can follow natural language instructions with human-level fluency. evaluating LLMs on realistic text generation tasks for healthcare remains challenging. We introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data.
arXiv Detail & Related papers (2023-08-27T12:24:39Z)
Appraising the Potential Uses and Harms of LLMs for Medical Systematic Reviews [21.546144601311187]
Large language models (LLMs) offer potential to automatically generate literature reviews on demand. LLMs sometimes generate inaccurate (and potentially misleading) texts by hallucination or omission.
arXiv Detail & Related papers (2023-05-19T17:09:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.