Trustworthy AI for Medicine: Continuous Hallucination Detection and Elimination with CHECK
- URL: http://arxiv.org/abs/2506.11129v1
- Date: Tue, 10 Jun 2025 17:12:28 GMT
- Title: Trustworthy AI for Medicine: Continuous Hallucination Detection and Elimination with CHECK
- Authors: Carlos Garcia-Fernandez, Luis Felipe, Monique Shotande, Muntasir Zitu, Aakash Tripathi, Ghulam Rasool, Issam El Naqa, Vivek Rudrapatna, Gilmer Valdes,
- Abstract summary: Large language models (LLMs) show promise in healthcare, but hallucinations remain a major barrier to clinical use.<n>We present CHECK, a continuous-learning framework that integrates structured clinical databases to detect hallucinations.
- Score: 1.3638020767676653
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) show promise in healthcare, but hallucinations remain a major barrier to clinical use. We present CHECK, a continuous-learning framework that integrates structured clinical databases with a classifier grounded in information theory to detect both factual and reasoning-based hallucinations. Evaluated on 1500 questions from 100 pivotal clinical trials, CHECK reduced LLama3.3-70B-Instruct hallucination rates from 31% to 0.3% - making an open source model state of the art. Its classifier generalized across medical benchmarks, achieving AUCs of 0.95-0.96, including on the MedQA (USMLE) benchmark and HealthBench realistic multi-turn medical questioning. By leveraging hallucination probabilities to guide GPT-4o's refinement and judiciously escalate compute, CHECK boosted its USMLE passing rate by 5 percentage points, achieving a state-of-the-art 92.1%. By suppressing hallucinations below accepted clinical error thresholds, CHECK offers a scalable foundation for safe LLM deployment in medicine and other high-stakes domains.
Related papers
- Preserving Privacy, Increasing Accessibility, and Reducing Cost: An On-Device Artificial Intelligence Model for Medical Transcription and Note Generation [0.0]
We develop and evaluate a privacy-preserving, on-device medical transcription system using a fine-tuned Llama 3.2 1B model.<n>The model is capable of generating structured medical notes from medical transcriptions while maintaining complete data sovereignty entirely in the browser.
arXiv Detail & Related papers (2025-07-03T01:51:49Z) - MedHallTune: An Instruction-Tuning Benchmark for Mitigating Medical Hallucination in Vision-Language Models [81.64135119165277]
hallucinations can jeopardize clinical decision making, potentially harming the diagnosis and treatments.<n>We propose MedHallTune, a large-scale benchmark designed specifically to evaluate and mitigate hallucinations in medical VLMs.<n>We conduct a comprehensive evaluation of current medical and general VLMs using MedHallTune, assessing their performance across key metrics, including clinical accuracy, relevance, detail level, and risk level.
arXiv Detail & Related papers (2025-02-28T06:59:49Z) - Medical Hallucinations in Foundation Models and Their Impact on Healthcare [53.97060824532454]
Foundation Models that are capable of processing and generating multi-modal data have transformed AI's role in medicine.<n>We define medical hallucination as any instance in which a model generates misleading medical content.<n>Our results reveal that inference techniques such as Chain-of-Thought (CoT) and Search Augmented Generation can effectively reduce hallucination rates.<n>These findings underscore the ethical and practical imperative for robust detection and mitigation strategies.
arXiv Detail & Related papers (2025-02-26T02:30:44Z) - MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models [82.30696225661615]
We introduce MedHallu, the first benchmark specifically designed for medical hallucination detection.<n>We show that state-of-the-art LLMs, including GPT-4o, Llama-3.1, and medically fine-tuned UltraMedical, struggle with this binary hallucination detection task.<n>Using bidirectional entailment clustering, we show that harder-to-detect hallucinations are semantically closer to ground truth.
arXiv Detail & Related papers (2025-02-20T06:33:23Z) - MedHallBench: A New Benchmark for Assessing Hallucination in Medical Large Language Models [0.0]
Medical Large Language Models (MLLMs) have demonstrated potential in healthcare applications.<n>Their propensity for hallucinations presents substantial risks to patient care.<n>This paper introduces MedHallBench, a comprehensive benchmark framework for evaluating and mitigating hallucinations in MLLMs.
arXiv Detail & Related papers (2024-12-25T16:51:29Z) - Towards Evaluating and Building Versatile Large Language Models for Medicine [57.49547766838095]
We present MedS-Bench, a benchmark designed to evaluate the performance of large language models (LLMs) in clinical contexts.
MedS-Bench spans 11 high-level clinical tasks, including clinical report summarization, treatment recommendations, diagnosis, named entity recognition, and medical concept explanation.
MedS-Ins comprises 58 medically oriented language corpora, totaling 13.5 million samples across 122 tasks.
arXiv Detail & Related papers (2024-08-22T17:01:34Z) - Detecting and Evaluating Medical Hallucinations in Large Vision Language Models [22.30139330566514]
Large Vision Language Models (LVLMs) are increasingly integral to healthcare applications.
LVLMs inherit susceptibility to hallucinations-a significant concern in high-stakes medical contexts.
We introduce Med-HallMark, the first benchmark specifically designed for hallucination detection and evaluation.
We also present MediHallDetector, a novel Medical LVLM engineered for precise hallucination detection.
arXiv Detail & Related papers (2024-06-14T17:14:22Z) - A New Benchmark and Reverse Validation Method for Passage-level
Hallucination Detection [63.56136319976554]
Large Language Models (LLMs) generate hallucinations, which can cause significant damage when deployed for mission-critical tasks.
We propose a self-check approach based on reverse validation to detect factual errors automatically in a zero-resource fashion.
We empirically evaluate our method and existing zero-resource detection methods on two datasets.
arXiv Detail & Related papers (2023-10-10T10:14:59Z) - A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of
LLMs by Validating Low-Confidence Generation [76.34411067299331]
Large language models often tend to 'hallucinate' which critically hampers their reliability.
We propose an approach that actively detects and mitigates hallucinations during the generation process.
We show that the proposed active detection and mitigation approach successfully reduces the hallucinations of the GPT-3.5 model from 47.5% to 14.5% on average.
arXiv Detail & Related papers (2023-07-08T14:25:57Z) - Clinical Camel: An Open Expert-Level Medical Language Model with
Dialogue-Based Knowledge Encoding [31.884600238089405]
We present Clinical Camel, an open large language model (LLM) explicitly tailored for clinical research.
Fine-tuned from LLaMA-2 using QLoRA, Clinical Camel achieves state-of-the-art performance across medical benchmarks among openly available medical LLMs.
arXiv Detail & Related papers (2023-05-19T23:07:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.