Related papers: Language models are susceptible to incorrect patient self-diagnosis in medical applications

Language models are susceptible to incorrect patient self-diagnosis in medical applications

URL: http://arxiv.org/abs/2309.09362v1
Date: Sun, 17 Sep 2023 19:56:39 GMT
Title: Language models are susceptible to incorrect patient self-diagnosis in medical applications
Authors: Rojin Ziaei and Samuel Schmidgall
Abstract summary: We present a variety of LLMs with multiple-choice questions from U.S. medical board exams modified to include self-diagnostic reports from patients. Our findings highlight that when a patient proposes incorrect bias-validating information, the diagnostic accuracy of LLMs drop dramatically.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are becoming increasingly relevant as a potential tool for healthcare, aiding communication between clinicians, researchers, and patients. However, traditional evaluations of LLMs on medical exam questions do not reflect the complexity of real patient-doctor interactions. An example of this complexity is the introduction of patient self-diagnosis, where a patient attempts to diagnose their own medical conditions from various sources. While the patient sometimes arrives at an accurate conclusion, they more often are led toward misdiagnosis due to the patient's over-emphasis on bias validating information. In this work we present a variety of LLMs with multiple-choice questions from United States medical board exams which are modified to include self-diagnostic reports from patients. Our findings highlight that when a patient proposes incorrect bias-validating information, the diagnostic accuracy of LLMs drop dramatically, revealing a high susceptibility to errors in self-diagnosis.

Related papers

Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions. We propose a novel approach utilizing structured medical reasoning. Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z)
Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment [108.55277188617035]
Large language models (LLMs) have been widely adopted in various downstream task domains, but their ability to directly recall and apply factual medical knowledge remains under-explored. Most existing medical QA benchmarks assess complex reasoning or multi-hop inference, making it difficult to isolate LLMs' inherent medical knowledge from their reasoning capabilities. We introduce the Medical Knowledge Judgment, a dataset specifically designed to measure LLMs' one-hop factual medical knowledge.
arXiv Detail & Related papers (2025-02-20T05:27:51Z)
ProMRVL-CAD: Proactive Dialogue System with Multi-Round Vision-Language Interactions for Computer-Aided Diagnosis [0.7430974817507225]
We develop an LLM-based dialogue system, namely proactive multi-round vision-language interactions for computer-aided diagnosis (ProMRVL-CAD) The proposed ProMRVL-CAD system allows proactive dialogue to provide patients with constant and reliable medical access via an integration of knowledge graph into a recommendation system.
arXiv Detail & Related papers (2025-02-15T01:14:23Z)
Ask Patients with Patience: Enabling LLMs for Human-Centric Medical Dialogue with Grounded Reasoning [5.520419627866446]
We introduce Ask Patients with Patience (APP), the first multi-turn dialogue that enables LLMs to iteratively refine diagnoses based on grounded reasoning. APP achieves higher similarity scores in diagnosis predictions, demonstrating better alignment with ground truth diagnoses. APP also excels in user accessibility and empathy, further bridging the gap between complex medical language and user understanding.
arXiv Detail & Related papers (2025-02-11T00:13:52Z)
DiversityMedQA: Assessing Demographic Biases in Medical Diagnosis using Large Language Models [2.750784330885499]
We introduce DiversityMedQA, a novel benchmark designed to assess large language models (LLMs) responses to medical queries across diverse patient demographics. Our findings reveal notable discrepancies in model performance when tested against these demographic variations.
arXiv Detail & Related papers (2024-09-02T23:37:20Z)
RuleAlign: Making Large Language Models Better Physicians with Diagnostic Rule Alignment [54.91736546490813]
We introduce the RuleAlign framework, designed to align Large Language Models with specific diagnostic rules. We develop a medical dialogue dataset comprising rule-based communications between patients and physicians. Experimental results demonstrate the effectiveness of the proposed approach.
arXiv Detail & Related papers (2024-08-22T17:44:40Z)
Assessing and Enhancing Large Language Models in Rare Disease Question-answering [64.32570472692187]
We introduce a rare disease question-answering (ReDis-QA) dataset to evaluate the performance of Large Language Models (LLMs) in diagnosing rare diseases. We collected 1360 high-quality question-answer pairs within the ReDis-QA dataset, covering 205 rare diseases. We then benchmarked several open-source LLMs, revealing that diagnosing rare diseases remains a significant challenge for these models. Experiment results demonstrate that ReCOP can effectively improve the accuracy of LLMs on the ReDis-QA dataset by an average of 8%.
arXiv Detail & Related papers (2024-08-15T21:09:09Z)
Digital Diagnostics: The Potential Of Large Language Models In Recognizing Symptoms Of Common Illnesses [0.2995925627097048]
This study evaluates each model diagnostic abilities by interpreting a user symptoms and determining diagnoses that fit well with common illnesses. GPT-4 demonstrates higher diagnostic accuracy from its deep and complete history of training on medical data. Gemini performs with high precision as a critical tool in disease triage, demonstrating its potential to be a reliable model.
arXiv Detail & Related papers (2024-05-09T15:12:24Z)
Conversational Disease Diagnosis via External Planner-Controlled Large Language Models [18.93345199841588]
This study presents a LLM-based diagnostic system that enhances planning capabilities by emulating doctors. By utilizing real patient electronic medical record data, we constructed simulated dialogues between virtual patients and doctors.
arXiv Detail & Related papers (2024-04-04T06:16:35Z)
Towards Reducing Diagnostic Errors with Interpretable Risk Prediction [18.474645862061426]
We propose a method to use LLMs to identify pieces of evidence in patient EHR data that indicate increased or decreased risk of specific diagnoses. Our ultimate aim is to increase access to evidence and reduce diagnostic errors.
arXiv Detail & Related papers (2024-02-15T17:05:48Z)
AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator [69.51568871044454]
We introduce textbfAI Hospital, a framework simulating dynamic medical interactions between emphDoctor as player and NPCs. This setup allows for realistic assessments of LLMs in clinical scenarios. We develop the Multi-View Medical Evaluation benchmark, utilizing high-quality Chinese medical records and NPCs.
arXiv Detail & Related papers (2024-02-15T06:46:48Z)
Self-Diagnosis and Large Language Models: A New Front for Medical Misinformation [8.738092015092207]
We evaluate the capabilities of large language models (LLMs) from the lens of a general user self-diagnosing. We develop a testing methodology which can be used to evaluate responses to open-ended questions mimicking real-world use cases. We reveal that a) these models perform worse than previously known, and b) they exhibit peculiar behaviours, including overconfidence when stating incorrect recommendations.
arXiv Detail & Related papers (2023-07-10T21:28:26Z)
SPeC: A Soft Prompt-Based Calibration on Performance Variability of Large Language Model in Clinical Notes Summarization [50.01382938451978]
We introduce a model-agnostic pipeline that employs soft prompts to diminish variance while preserving the advantages of prompt-based summarization. Experimental findings indicate that our method not only bolsters performance but also effectively curbs variance for various language models.
arXiv Detail & Related papers (2023-03-23T04:47:46Z)
Clinical Outcome Prediction from Admission Notes using Self-Supervised Knowledge Integration [55.88616573143478]
Outcome prediction from clinical text can prevent doctors from overlooking possible risks. Diagnoses at discharge, procedures performed, in-hospital mortality and length-of-stay prediction are four common outcome prediction targets. We propose clinical outcome pre-training to integrate knowledge about patient outcomes from multiple public sources.
arXiv Detail & Related papers (2021-02-08T10:26:44Z)
Towards Causality-Aware Inferring: A Sequential Discriminative Approach for Medical Diagnosis [142.90770786804507]
Medical diagnosis assistant (MDA) aims to build an interactive diagnostic agent to sequentially inquire about symptoms for discriminating diseases. This work attempts to address these critical issues in MDA by taking advantage of the causal diagram. We propose a propensity-based patient simulator to effectively answer unrecorded inquiry by drawing knowledge from the other records.
arXiv Detail & Related papers (2020-03-14T02:05:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.