Medical Hallucinations in Foundation Models and Their Impact on Healthcare
- URL: http://arxiv.org/abs/2503.05777v2
- Date: Sun, 02 Nov 2025 11:26:10 GMT
- Title: Medical Hallucinations in Foundation Models and Their Impact on Healthcare
- Authors: Yubin Kim, Hyewon Jeong, Shan Chen, Shuyue Stella Li, Chanwoo Park, Mingyu Lu, Kumail Alhamoud, Jimin Mun, Cristina Grau, Minseok Jung, Rodrigo Gameiro, Lizhou Fan, Eugene Park, Tristan Lin, Joonsik Yoon, Wonjin Yoon, Maarten Sap, Yulia Tsvetkov, Paul Liang, Xuhai Xu, Xin Liu, Chunjong Park, Hyeonhoon Lee, Hae Won Park, Daniel McDuff, Samir Tulebaev, Cynthia Breazeal,
- Abstract summary: Hallucinations in foundation models arise from autoregressive training objectives.<n>Top-performing models exceeded 97% accuracy when augmented with chain-of-thought prompting.
- Score: 71.15392179084428
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Hallucinations in foundation models arise from autoregressive training objectives that prioritize token-likelihood optimization over epistemic accuracy, fostering overconfidence and poorly calibrated uncertainty. We define medical hallucination as any model-generated output that is factually incorrect, logically inconsistent, or unsupported by authoritative clinical evidence in ways that could alter clinical decisions. We evaluated 11 foundation models (7 general-purpose, 4 medical-specialized) across seven medical hallucination tasks spanning medical reasoning and biomedical information retrieval. General-purpose models achieved significantly higher proportions of hallucination-free responses than medical-specialized models (median: 76.6% vs 51.3%, difference = 25.2%, 95% CI: 18.7-31.3%, Mann-Whitney U = 27.0, p = 0.012, rank-biserial r = -0.64). Top-performing models such as Gemini-2.5 Pro exceeded 97% accuracy when augmented with chain-of-thought prompting (base: 87.6%), while medical-specialized models like MedGemma ranged from 28.6-61.9% despite explicit training on medical corpora. Chain-of-thought reasoning significantly reduced hallucinations in 86.4% of tested comparisons after FDR correction (q < 0.05), demonstrating that explicit reasoning traces enable self-verification and error detection. Physician audits confirmed that 64-72% of residual hallucinations stemmed from causal or temporal reasoning failures rather than knowledge gaps. A global survey of clinicians (n = 70) validated real-world impact: 91.8% had encountered medical hallucinations, and 84.7% considered them capable of causing patient harm. The underperformance of medical-specialized models despite domain training indicates that safety emerges from sophisticated reasoning capabilities and broad knowledge integration developed during large-scale pre-training, not from narrow optimization.
Related papers
- PanCanBench: A Comprehensive Benchmark for Evaluating Large Language Models in Pancreatic Oncology [48.732366302949515]
Large language models (LLMs) have achieved expert-level performance on standardized examinations, yet multiple-choice accuracy poorly reflects real-world clinical utility and safety.<n>We developed a human-in-the-loop pipeline to create expert rubrics for de-identified patient questions.<n>We evaluated 22 proprietary and open-source LLMs using an LLM-as-a-judge framework, measuring clinical completeness, factual accuracy, and web-search integration.
arXiv Detail & Related papers (2026-03-02T00:50:39Z) - Beyond Calibration: Confounding Pathology Limits Foundation Model Specificity in Abdominal Trauma CT [8.050646314390763]
Translating foundation models into clinical practice requires evaluating their performance under compound distribution shift.<n>We investigated whether specificity deficits in foundation models are associated with heterogeneity in the negative class.
arXiv Detail & Related papers (2026-02-10T23:08:06Z) - Diagnosing Hallucination Risk in AI Surgical Decision-Support: A Sequential Framework for Sequential Validation [5.469454486414467]
Large language models (LLMs) offer transformative potential for clinical decision support in spine surgery.<n>LLMs pose significant risks through hallucinations, which are factually inconsistent or contextually misaligned outputs.<n>This study introduces a clinician-centered framework to quantify hallucination risks by evaluating diagnostic precision, recommendation quality, reasoning robustness, output coherence, and knowledge alignment.
arXiv Detail & Related papers (2025-11-01T15:25:55Z) - Predicting Metabolic Dysfunction-Associated Steatotic Liver Disease using Machine Learning Methods [0.8642326601683298]
We developed a fair, rigorous, and reproducible MASLD prediction model.<n>MASLD affects 33% of U.S. adults and is the most common chronic liver disease.
arXiv Detail & Related papers (2025-10-25T13:36:18Z) - EchoBench: Benchmarking Sycophancy in Medical Large Vision-Language Models [82.43729208063468]
Recent benchmarks for medical Large Vision-Language Models (LVLMs) emphasize leaderboard accuracy, overlooking reliability and safety.<n>We study sycophancy -- models' tendency to uncritically echo user-provided information.<n>We introduce EchoBench, a benchmark to systematically evaluate sycophancy in medical LVLMs.
arXiv Detail & Related papers (2025-09-24T14:09:55Z) - MedOmni-45°: A Safety-Performance Benchmark for Reasoning-Oriented LLMs in Medicine [69.08855631283829]
We introduce Med Omni-45 Degrees, a benchmark designed to quantify safety-performance trade-offs under manipulative hint conditions.<n>It contains 1,804 reasoning-focused medical questions across six specialties and three task types, including 500 from MedMCQA.<n>Results show a consistent safety-performance trade-off, with no model surpassing the diagonal.
arXiv Detail & Related papers (2025-08-22T08:38:16Z) - Trustworthy AI for Medicine: Continuous Hallucination Detection and Elimination with CHECK [1.3638020767676653]
Large language models (LLMs) show promise in healthcare, but hallucinations remain a major barrier to clinical use.<n>We present CHECK, a continuous-learning framework that integrates structured clinical databases to detect hallucinations.
arXiv Detail & Related papers (2025-06-10T17:12:28Z) - Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning [57.873833577058]
We build a multimodal dataset enriched with extensive medical knowledge.<n>We then introduce our medical-specialized MLLM: Lingshu.<n>Lingshu undergoes multi-stage training to embed medical expertise and enhance its task-solving capabilities.
arXiv Detail & Related papers (2025-06-08T08:47:30Z) - Fact-Controlled Diagnosis of Hallucinations in Medical Text Summarization [8.057050705357973]
Hallucinations in large language models (LLMs) pose significant risks to patient care and clinical decision-making.<n>General-domain detectors struggle to detect clinical hallucinations, and that performance on fact-controlled hallucinations does not reliably predict effectiveness on natural hallucinations.<n>We develop fact-based approaches that count hallucinations, offering explainability not available with existing methods.
arXiv Detail & Related papers (2025-05-31T08:04:37Z) - HalluLens: LLM Hallucination Benchmark [49.170128733508335]
Large language models (LLMs) often generate responses that deviate from user input or training data, a phenomenon known as "hallucination"
This paper introduces a comprehensive hallucination benchmark, incorporating both new extrinsic and existing intrinsic evaluation tasks.
arXiv Detail & Related papers (2025-04-24T13:40:27Z) - A Clinician-Friendly Platform for Ophthalmic Image Analysis Without Technical Barriers [51.45596445363302]
GlobeReady is a clinician-friendly AI platform that enables fundus disease diagnosis without retraining, fine-tuning, or the needs for technical expertise.<n>We demonstrate high accuracy across imaging modalities: 93.9-98.5% for 11 fundus diseases using color fundus photographs (CPFs) and 87.2-92.7% for 15 fundus diseases using optic coherence tomography ( OCT) scans.<n>By leveraging training-free local feature augmentation, GlobeReady platform effectively mitigates domain shifts across centers and populations.
arXiv Detail & Related papers (2025-04-22T14:17:22Z) - MedHal: An Evaluation Dataset for Medical Hallucination Detection [2.5782420501870296]
We present MedHal, a novel large-scale dataset specifically designed to evaluate if models can detect hallucinations in medical texts.
MedHal addresses gaps by: (1) incorporating diverse medical text sources and tasks; (2) providing a substantial volume of annotated samples suitable for training medical hallucination detection models; and (3) including explanations for factual inconsistencies to guide model learning.
arXiv Detail & Related papers (2025-04-11T14:55:15Z) - Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions.
We propose a novel approach utilizing structured medical reasoning.
Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z) - MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models [82.30696225661615]
We introduce MedHallu, the first benchmark specifically designed for medical hallucination detection.<n>We show that state-of-the-art LLMs, including GPT-4o, Llama-3.1, and medically fine-tuned UltraMedical, struggle with this binary hallucination detection task.<n>Using bidirectional entailment clustering, we show that harder-to-detect hallucinations are semantically closer to ground truth.
arXiv Detail & Related papers (2025-02-20T06:33:23Z) - Towards Reliable Medical Question Answering: Techniques and Challenges in Mitigating Hallucinations in Language Models [1.03590082373586]
This paper conducts a scoping study of existing techniques for mitigating hallucinations in knowledge-based task in general and especially for medical domains.
Key methods covered in the paper include Retrieval-Augmented Generation (RAG)-based techniques, iterative feedback loops, supervised fine-tuning, and prompt engineering.
These techniques, while promising in general contexts, require further adaptation and optimization for the medical domain due to its unique demands for up-to-date, specialized knowledge and strict adherence to medical guidelines.
arXiv Detail & Related papers (2024-08-25T11:09:15Z) - Extrinsically-Focused Evaluation of Omissions in Medical Summarization [9.847304366680772]
Large language models (LLMs) have shown promise in safety-critical applications such as healthcare, yet the ability to quantify performance has lagged.
We propose MED-OMIT as a metric to explore the challenge of evaluating a summary of a patient's medical record.
arXiv Detail & Related papers (2023-11-14T16:46:15Z) - Towards Mitigating Hallucination in Large Language Models via
Self-Reflection [63.2543947174318]
Large language models (LLMs) have shown promise for generative and knowledge-intensive tasks including question-answering (QA) tasks.
This paper analyses the phenomenon of hallucination in medical generative QA systems using widely adopted LLMs and datasets.
arXiv Detail & Related papers (2023-10-10T03:05:44Z) - Med-HALT: Medical Domain Hallucination Test for Large Language Models [0.0]
This research paper focuses on the challenges posed by hallucinations in large language models (LLMs)
We propose a new benchmark and dataset, Med-HALT (Medical Domain Hallucination Test), designed specifically to evaluate and reduce hallucinations.
arXiv Detail & Related papers (2023-07-28T06:43:04Z) - DKINet: Medication Recommendation via Domain Knowledge Informed Deep Learning [12.609882335746859]
Medication recommendation is a fundamental yet crucial branch of healthcare.
Previous studies have primarily focused on learning patient representation from electronic health records.
We propose a knowledge injection module that addresses the effective integration of domain knowledge with complex clinical manifestations.
arXiv Detail & Related papers (2023-05-31T07:22:15Z) - SPeC: A Soft Prompt-Based Calibration on Performance Variability of
Large Language Model in Clinical Notes Summarization [50.01382938451978]
We introduce a model-agnostic pipeline that employs soft prompts to diminish variance while preserving the advantages of prompt-based summarization.
Experimental findings indicate that our method not only bolsters performance but also effectively curbs variance for various language models.
arXiv Detail & Related papers (2023-03-23T04:47:46Z) - Retrieval-Augmented and Knowledge-Grounded Language Models for Faithful Clinical Medicine [68.7814360102644]
We propose the Re$3$Writer method with retrieval-augmented generation and knowledge-grounded reasoning.
We demonstrate the effectiveness of our method in generating patient discharge instructions.
arXiv Detail & Related papers (2022-10-23T16:34:39Z) - Towards more patient friendly clinical notes through language models and
ontologies [57.51898902864543]
We present a novel approach to automated medical text based on word simplification and language modelling.
We use a new dataset pairs of publicly available medical sentences and a version of them simplified by clinicians.
Our method based on a language model trained on medical forum data generates simpler sentences while preserving both grammar and the original meaning.
arXiv Detail & Related papers (2021-12-23T16:11:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.