Language models are susceptible to incorrect patient self-diagnosis in
medical applications
- URL: http://arxiv.org/abs/2309.09362v1
- Date: Sun, 17 Sep 2023 19:56:39 GMT
- Title: Language models are susceptible to incorrect patient self-diagnosis in
medical applications
- Authors: Rojin Ziaei and Samuel Schmidgall
- Abstract summary: We present a variety of LLMs with multiple-choice questions from U.S. medical board exams modified to include self-diagnostic reports from patients.
Our findings highlight that when a patient proposes incorrect bias-validating information, the diagnostic accuracy of LLMs drop dramatically.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) are becoming increasingly relevant as a
potential tool for healthcare, aiding communication between clinicians,
researchers, and patients. However, traditional evaluations of LLMs on medical
exam questions do not reflect the complexity of real patient-doctor
interactions. An example of this complexity is the introduction of patient
self-diagnosis, where a patient attempts to diagnose their own medical
conditions from various sources. While the patient sometimes arrives at an
accurate conclusion, they more often are led toward misdiagnosis due to the
patient's over-emphasis on bias validating information. In this work we present
a variety of LLMs with multiple-choice questions from United States medical
board exams which are modified to include self-diagnostic reports from
patients. Our findings highlight that when a patient proposes incorrect
bias-validating information, the diagnostic accuracy of LLMs drop dramatically,
revealing a high susceptibility to errors in self-diagnosis.
Related papers
- Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment [108.55277188617035]
Large language models (LLMs) have been widely adopted in various downstream task domains, but their ability to directly recall and apply factual medical knowledge remains under-explored.
Most existing medical QA benchmarks assess complex reasoning or multi-hop inference, making it difficult to isolate LLMs' inherent medical knowledge from their reasoning capabilities.
We introduce the Medical Knowledge Judgment, a dataset specifically designed to measure LLMs' one-hop factual medical knowledge.
arXiv Detail & Related papers (2025-02-20T05:27:51Z) - Ask Patients with Patience: Enabling LLMs for Human-Centric Medical Dialogue with Grounded Reasoning [5.520419627866446]
We introduce Ask Patients with Patience (APP), the first multi-turn dialogue that enables LLMs to iteratively refine diagnoses based on grounded reasoning.
APP achieves higher similarity scores in diagnosis predictions, demonstrating better alignment with ground truth diagnoses.
APP also excels in user accessibility and empathy, further bridging the gap between complex medical language and user understanding.
arXiv Detail & Related papers (2025-02-11T00:13:52Z) - Exploring the Inquiry-Diagnosis Relationship with Advanced Patient Simulators [5.217925404425509]
We conduct experiments to explore the relationship between "inquiry" and "diagnosis" in the consultation process.
We categorize the inquiry process into four types: (1) chief complaint inquiry; (2) specification of known symptoms; (3) inquiry about accompanying symptoms; and (4) gathering family or medical history.
arXiv Detail & Related papers (2025-01-16T11:41:14Z) - DiversityMedQA: Assessing Demographic Biases in Medical Diagnosis using Large Language Models [2.750784330885499]
We introduce DiversityMedQA, a novel benchmark designed to assess large language models (LLMs) responses to medical queries across diverse patient demographics.
Our findings reveal notable discrepancies in model performance when tested against these demographic variations.
arXiv Detail & Related papers (2024-09-02T23:37:20Z) - RuleAlign: Making Large Language Models Better Physicians with Diagnostic Rule Alignment [54.91736546490813]
We introduce the RuleAlign framework, designed to align Large Language Models with specific diagnostic rules.
We develop a medical dialogue dataset comprising rule-based communications between patients and physicians.
Experimental results demonstrate the effectiveness of the proposed approach.
arXiv Detail & Related papers (2024-08-22T17:44:40Z) - Assessing and Enhancing Large Language Models in Rare Disease Question-answering [64.32570472692187]
We introduce a rare disease question-answering (ReDis-QA) dataset to evaluate the performance of Large Language Models (LLMs) in diagnosing rare diseases.
We collected 1360 high-quality question-answer pairs within the ReDis-QA dataset, covering 205 rare diseases.
We then benchmarked several open-source LLMs, revealing that diagnosing rare diseases remains a significant challenge for these models.
Experiment results demonstrate that ReCOP can effectively improve the accuracy of LLMs on the ReDis-QA dataset by an average of 8%.
arXiv Detail & Related papers (2024-08-15T21:09:09Z) - Digital Diagnostics: The Potential Of Large Language Models In Recognizing Symptoms Of Common Illnesses [0.2995925627097048]
This study evaluates each model diagnostic abilities by interpreting a user symptoms and determining diagnoses that fit well with common illnesses.
GPT-4 demonstrates higher diagnostic accuracy from its deep and complete history of training on medical data.
Gemini performs with high precision as a critical tool in disease triage, demonstrating its potential to be a reliable model.
arXiv Detail & Related papers (2024-05-09T15:12:24Z) - AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator [69.51568871044454]
We introduce textbfAI Hospital, a framework simulating dynamic medical interactions between emphDoctor as player and NPCs.
This setup allows for realistic assessments of LLMs in clinical scenarios.
We develop the Multi-View Medical Evaluation benchmark, utilizing high-quality Chinese medical records and NPCs.
arXiv Detail & Related papers (2024-02-15T06:46:48Z) - Self-Diagnosis and Large Language Models: A New Front for Medical
Misinformation [8.738092015092207]
We evaluate the capabilities of large language models (LLMs) from the lens of a general user self-diagnosing.
We develop a testing methodology which can be used to evaluate responses to open-ended questions mimicking real-world use cases.
We reveal that a) these models perform worse than previously known, and b) they exhibit peculiar behaviours, including overconfidence when stating incorrect recommendations.
arXiv Detail & Related papers (2023-07-10T21:28:26Z) - SPeC: A Soft Prompt-Based Calibration on Performance Variability of
Large Language Model in Clinical Notes Summarization [50.01382938451978]
We introduce a model-agnostic pipeline that employs soft prompts to diminish variance while preserving the advantages of prompt-based summarization.
Experimental findings indicate that our method not only bolsters performance but also effectively curbs variance for various language models.
arXiv Detail & Related papers (2023-03-23T04:47:46Z) - Towards Causality-Aware Inferring: A Sequential Discriminative Approach
for Medical Diagnosis [142.90770786804507]
Medical diagnosis assistant (MDA) aims to build an interactive diagnostic agent to sequentially inquire about symptoms for discriminating diseases.
This work attempts to address these critical issues in MDA by taking advantage of the causal diagram.
We propose a propensity-based patient simulator to effectively answer unrecorded inquiry by drawing knowledge from the other records.
arXiv Detail & Related papers (2020-03-14T02:05:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.