Related papers: Large language models provide unsafe answers to patient-posed medical questions

Large language models provide unsafe answers to patient-posed medical questions

URL: http://arxiv.org/abs/2507.18905v2
Date: Mon, 04 Aug 2025 21:31:09 GMT
Title: Large language models provide unsafe answers to patient-posed medical questions
Authors: Rachel L. Draelos, Samina Afreen, Barbara Blasko, Tiffany L. Brazile, Natasha Chase, Dimple Patel Desai, Jessica Evert, Heather L. Gardner, Lauren Herrmann, Aswathy Vaikom House, Stephanie Kass, Marianne Kavan, Kirshma Khemani, Amanda Koire, Lauren M. McDonald, Zahraa Rabeeah, Amy Shah,
Abstract summary: We compare the safety of four publicly available chatbots--Claude by Anthropic, Gemini by Google, GPT-4o by OpenAI, and Llama3-70B by Meta--on a new dataset, HealthAdvice.<n>The rate of problematic responses varies from 21.6 percent (Claude) to 43.2 percent (Llama), with unsafe responses varying from 5 percent (Claude) to 13 percent (GPT-4o, Llama)
Score: 0.12568469427065204
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Millions of patients are already using large language model (LLM) chatbots for medical advice on a regular basis, raising patient safety concerns. This physician-led red-teaming study compares the safety of four publicly available chatbots--Claude by Anthropic, Gemini by Google, GPT-4o by OpenAI, and Llama3-70B by Meta--on a new dataset, HealthAdvice, using an evaluation framework that enables quantitative and qualitative analysis. In total, 888 chatbot responses are evaluated for 222 patient-posed advice-seeking medical questions on primary care topics spanning internal medicine, women's health, and pediatrics. We find statistically significant differences between chatbots. The rate of problematic responses varies from 21.6 percent (Claude) to 43.2 percent (Llama), with unsafe responses varying from 5 percent (Claude) to 13 percent (GPT-4o, Llama). Qualitative results reveal chatbot responses with the potential to lead to serious patient harm. This study suggests that millions of patients could be receiving unsafe medical advice from publicly available chatbots, and further work is needed to improve the clinical safety of these powerful tools.

Related papers

Medical Red Teaming Protocol of Language Models: On the Importance of User Perspectives in Healthcare Settings [51.73411055162861]
We introduce a safety evaluation protocol tailored to the medical domain in both patient user and clinician user perspectives.<n>This is the first work to define safety evaluation criteria for medical LLMs through targeted red-teaming taking three different points of view.
arXiv Detail & Related papers (2025-07-09T19:38:58Z)
Development and Evaluation of HopeBot: an LLM-based chatbot for structured and interactive PHQ-9 depression screening [48.355615275247786]
HopeBot administers the Patient Health Questionnaire-9 (PHQ-9) using retrieval-augmented generation and real-time clarification.<n>In a within-subject study, 132 adults in the United Kingdom and China completed both self-administered and chatbots versions.<n>Overall, 87.1% expressed willingness to reuse or recommend HopeBot.
arXiv Detail & Related papers (2025-07-08T13:41:22Z)
Cancer-Myth: Evaluating AI Chatbot on Patient Questions with False Presuppositions [16.21971764311474]
We evaluate large language models (LLMs) on cancer-related questions drawn from real patients.<n>LLMs frequently fail to recognize or address false presuppositions in the questions.<n>These findings expose a critical gap in the clinical reliability of LLMs.
arXiv Detail & Related papers (2025-04-15T16:37:32Z)
Clean & Clear: Feasibility of Safe LLM Clinical Guidance [2.0194749607835014]
Clinical guidelines are central to safe evidence-based medicine in modern healthcare.<n>We developed an open-weight Llama-3.1-8B LLM to extract relevant information from the UCLH guidelines to answer questions.<n>73% of its responses rated as very relevant, showcasing a strong understanding of the clinical context.
arXiv Detail & Related papers (2025-03-26T19:36:43Z)
Conversational Medical AI: Ready for Practice [0.19791587637442667]
We present the first large-scale evaluation of a physician-supervised conversational agent in a real-world medical setting.<n>Our agent, Mo, was integrated into an existing medical advice chat service.
arXiv Detail & Related papers (2024-11-19T19:00:31Z)
Assessing Empathy in Large Language Models with Real-World Physician-Patient Interactions [9.327472312657392]
The integration of Large Language Models (LLMs) into the healthcare domain has the potential to significantly enhance patient care and support. This study investigates the question Can ChatGPT respond with a greater degree of empathy than those typically offered by physicians? We collect a de-identified dataset of patient messages and physician responses from Mayo Clinic and generate alternative replies using ChatGPT.
arXiv Detail & Related papers (2024-05-26T01:58:57Z)
How Reliable AI Chatbots are for Disease Prediction from Patient Complaints? [0.0]
This study examines the reliability of AI chatbots, specifically GPT 4.0, Claude 3 Opus, and Gemini Ultra 1.0, in predicting diseases from patient complaints in the emergency department. Results suggest that GPT 4.0 achieves high accuracy with increased few-shot data, while Gemini Ultra 1.0 performs well with fewer examples, and Claude 3 Opus maintains consistent performance.
arXiv Detail & Related papers (2024-05-21T22:00:13Z)
Healthcare Copilot: Eliciting the Power of General LLMs for Medical Consultation [96.22329536480976]
We introduce the construction of a Healthcare Copilot designed for medical consultation. The proposed Healthcare Copilot comprises three main components: 1) the Dialogue component, responsible for effective and safe patient interactions; 2) the Memory component, storing both current conversation data and historical patient information; and 3) the Processing component, summarizing the entire dialogue and generating reports. To evaluate the proposed Healthcare Copilot, we implement an auto-evaluation scheme using ChatGPT for two roles: as a virtual patient engaging in dialogue with the copilot, and as an evaluator to assess the quality of the dialogue.
arXiv Detail & Related papers (2024-02-20T22:26:35Z)
Impact of Large Language Model Assistance on Patients Reading Clinical Notes: A Mixed-Methods Study [46.5728291706842]
We developed a patient-facing tool using large language models (LLMs) to make clinical notes more readable. We piloted the tool with clinical notes donated by patients with a history of breast cancer and synthetic notes from a clinician.
arXiv Detail & Related papers (2024-01-17T23:14:52Z)
A Systematic Evaluation of GPT-4V's Multimodal Capability for Medical Image Analysis [87.25494411021066]
GPT-4V's multimodal capability for medical image analysis is evaluated. It is found that GPT-4V excels in understanding medical images and generates high-quality radiology reports. It is found that its performance for medical visual grounding needs to be substantially improved.
arXiv Detail & Related papers (2023-10-31T11:39:09Z)
Human Evaluation and Correlation with Automatic Metrics in Consultation Note Generation [56.25869366777579]
In recent years, machine learning models have rapidly become better at generating clinical consultation notes. We present an extensive human evaluation study where 5 clinicians listen to 57 mock consultations, write their own notes, post-edit a number of automatically generated notes, and extract all the errors. We find that a simple, character-based Levenshtein distance metric performs on par if not better than common model-based metrics like BertScore.
arXiv Detail & Related papers (2022-04-01T14:04:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.