The Reliability of LLMs for Medical Diagnosis: An Examination of Consistency, Manipulation, and Contextual Awareness
- URL: http://arxiv.org/abs/2503.10647v1
- Date: Sun, 02 Mar 2025 11:50:16 GMT
- Title: The Reliability of LLMs for Medical Diagnosis: An Examination of Consistency, Manipulation, and Contextual Awareness
- Authors: Krishna Subedi,
- Abstract summary: Large Language Models (LLMs) offer promise for democratizing healthcare with advanced diagnostics.<n>This study assesses their diagnostic reliability focusing on consistency, manipulation resilience, and contextual integration.<n>LLMs' vulnerability to manipulation and limited contextual awareness pose challenges in clinical use.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Universal healthcare access is critically needed, especially in resource-limited settings. Large Language Models (LLMs) offer promise for democratizing healthcare with advanced diagnostics, but their reliability requires thorough evaluation, especially in trust-dependent environments. This study assesses LLMs' diagnostic reliability focusing on consistency, manipulation resilience, and contextual integration, crucial for safe and ethical use in universal healthcare. We evaluated leading LLMs using 52 patient cases, expanded into variants with demographic changes, symptom rewordings, and exam modifications, while keeping core diagnoses constant. Manipulation susceptibility was tested by inserting misleading narratives and irrelevant details. Contextual awareness was rvaluated by comparing diagnoses with and without patient history. We analyzed diagnostic change rates and response patterns across manipulations. LLMs showed perfect diagnostic consistency for identical data but significant manipulation susceptibility. Gemini had a 40% diagnosis change rate and ChatGPT 30% with irrelevant details. ChatGPT had a higher context influence rate (77.8% vs. Gemini's 55.6%), but both showed limited nuanced contextual integration, exhibiting anchoring bias by prioritizing salient data over context. LLMs' vulnerability to manipulation and limited contextual awareness pose challenges in clinical use. Unlike clinicians, they may overstate diagnostic certainty without validation. Safeguards and domain-specific designs are crucial for reliable healthcare applications. Broad clinical use without oversight is premature and risky. LLMs can enhance diagnostics with responsible use, but future research is needed to improve manipulation resistance and contextual understanding for safe healthcare democratization.
Related papers
- Med-CoDE: Medical Critique based Disagreement Evaluation Framework [72.42301910238861]
The reliability and accuracy of large language models (LLMs) in medical contexts remain critical concerns.
Current evaluation methods often lack robustness and fail to provide a comprehensive assessment of LLM performance.
We propose Med-CoDE, a specifically designed evaluation framework for medical LLMs to address these challenges.
arXiv Detail & Related papers (2025-04-21T16:51:11Z) - LLM Sensitivity Evaluation Framework for Clinical Diagnosis [10.448772462311027]
Large language models (LLMs) have demonstrated impressive performance across various domains.
For clinical diagnosis, higher expectations are required for LLM's reliability and sensitivity.
arXiv Detail & Related papers (2025-04-18T05:35:11Z) - Leveraging LLMs for Predicting Unknown Diagnoses from Clinical Notes [21.43498764977656]
Discharge summaries tend to provide more complete information, which can help infer accurate diagnoses.
This study investigates whether large language models (LLMs) can predict implicitly mentioned diagnoses from clinical notes and link them to corresponding medications.
arXiv Detail & Related papers (2025-03-28T02:15:57Z) - Medical Reasoning in LLMs: An In-Depth Analysis of DeepSeek R1 [0.0]
This study assesses DeepSeek R1's medical reasoning against expert patterns using 100 MedQA clinical cases.
The model achieved 93% diagnostic accuracy, demonstrating systematic clinical judgment through differential diagnosis, guideline-based treatment selection, and integration of patient-specific factors.
Error analysis of seven incorrect cases revealed persistent limitations: anchoring bias, challenges reconciling conflicting data, insufficient exploration of alternatives, overthinking, knowledge gaps, and premature prioritization of definitive treatment over intermediate care.
arXiv Detail & Related papers (2025-03-27T09:18:08Z) - Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references.<n>We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey.<n>Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z) - Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions.<n>We propose a novel approach utilizing structured medical reasoning.<n>Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z) - Detecting Bias and Enhancing Diagnostic Accuracy in Large Language Models for Healthcare [0.2302001830524133]
Biased AI-generated medical advice and misdiagnoses can jeopardize patient safety.
This study introduces new resources designed to promote ethical and precise AI in healthcare.
arXiv Detail & Related papers (2024-10-09T06:00:05Z) - CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models [92.04812189642418]
We introduce CARES and aim to evaluate the Trustworthiness of Med-LVLMs across the medical domain.
We assess the trustworthiness of Med-LVLMs across five dimensions, including trustfulness, fairness, safety, privacy, and robustness.
arXiv Detail & Related papers (2024-06-10T04:07:09Z) - MediQ: Question-Asking LLMs and a Benchmark for Reliable Interactive Clinical Reasoning [36.400896909161006]
We develop systems that proactively ask questions to gather more information and respond reliably.
We introduce a benchmark - MediQ - to evaluate question-asking ability in LLMs.
arXiv Detail & Related papers (2024-06-03T01:32:52Z) - Digital Diagnostics: The Potential Of Large Language Models In Recognizing Symptoms Of Common Illnesses [0.2995925627097048]
This study evaluates each model diagnostic abilities by interpreting a user symptoms and determining diagnoses that fit well with common illnesses.
GPT-4 demonstrates higher diagnostic accuracy from its deep and complete history of training on medical data.
Gemini performs with high precision as a critical tool in disease triage, demonstrating its potential to be a reliable model.
arXiv Detail & Related papers (2024-05-09T15:12:24Z) - Enhancing Diagnostic Accuracy through Multi-Agent Conversations: Using Large Language Models to Mitigate Cognitive Bias [5.421033429862095]
Cognitive biases in clinical decision-making significantly contribute to errors in diagnosis and suboptimal patient outcomes.
This study explores the role of large language models in mitigating these biases through the utilization of a multi-agent framework.
arXiv Detail & Related papers (2024-01-26T01:35:50Z) - Medical Misinformation in AI-Assisted Self-Diagnosis: Development of a Method (EvalPrompt) for Analyzing Large Language Models [4.8775268199830935]
This study aims to assess the effectiveness of large language models (LLMs) as a self-diagnostic tool and their role in spreading healthcare misinformation.
We use open-ended questions to mimic real-world self-diagnosis use cases, and perform sentence dropout to mimic realistic self-diagnosis with missing information.
The results highlight the modest capabilities of LLMs, as their responses are often unclear and inaccurate.
arXiv Detail & Related papers (2023-07-10T21:28:26Z) - Self-Verification Improves Few-Shot Clinical Information Extraction [73.6905567014859]
Large language models (LLMs) have shown the potential to accelerate clinical curation via few-shot in-context learning.
They still struggle with issues regarding accuracy and interpretability, especially in mission-critical domains such as health.
Here, we explore a general mitigation framework using self-verification, which leverages the LLM to provide provenance for its own extraction and check its own outputs.
arXiv Detail & Related papers (2023-05-30T22:05:11Z) - Towards Reliable Medical Image Segmentation by utilizing Evidential Calibrated Uncertainty [52.03490691733464]
We introduce DEviS, an easily implementable foundational model that seamlessly integrates into various medical image segmentation networks.
By leveraging subjective logic theory, we explicitly model probability and uncertainty for the problem of medical image segmentation.
DeviS incorporates an uncertainty-aware filtering module, which utilizes the metric of uncertainty-calibrated error to filter reliable data.
arXiv Detail & Related papers (2023-01-01T05:02:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.