Related papers: MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models

MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models

URL: http://arxiv.org/abs/2502.14302v1
Date: Thu, 20 Feb 2025 06:33:23 GMT
Title: MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models
Authors: Shrey Pandit, Jiawei Xu, Junyuan Hong, Zhangyang Wang, Tianlong Chen, Kaidi Xu, Ying Ding,
Abstract summary: We introduce MedHallu, the first benchmark specifically designed for medical hallucination detection.<n>We show that state-of-the-art LLMs, including GPT-4o, Llama-3.1, and medically fine-tuned UltraMedical, struggle with this binary hallucination detection task.<n>Using bidirectional entailment clustering, we show that harder-to-detect hallucinations are semantically closer to ground truth.
Score: 82.30696225661615
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Advancements in Large Language Models (LLMs) and their increasing use in medical question-answering necessitate rigorous evaluation of their reliability. A critical challenge lies in hallucination, where models generate plausible yet factually incorrect outputs. In the medical domain, this poses serious risks to patient safety and clinical decision-making. To address this, we introduce MedHallu, the first benchmark specifically designed for medical hallucination detection. MedHallu comprises 10,000 high-quality question-answer pairs derived from PubMedQA, with hallucinated answers systematically generated through a controlled pipeline. Our experiments show that state-of-the-art LLMs, including GPT-4o, Llama-3.1, and the medically fine-tuned UltraMedical, struggle with this binary hallucination detection task, with the best model achieving an F1 score as low as 0.625 for detecting "hard" category hallucinations. Using bidirectional entailment clustering, we show that harder-to-detect hallucinations are semantically closer to ground truth. Through experiments, we also show incorporating domain-specific knowledge and introducing a "not sure" category as one of the answer categories improves the precision and F1 scores by up to 38% relative to baselines.

Related papers

Trustworthy AI for Medicine: Continuous Hallucination Detection and Elimination with CHECK [1.3638020767676653]
Large language models (LLMs) show promise in healthcare, but hallucinations remain a major barrier to clinical use.<n>We present CHECK, a continuous-learning framework that integrates structured clinical databases to detect hallucinations.
arXiv Detail & Related papers (2025-06-10T17:12:28Z)
MedHal: An Evaluation Dataset for Medical Hallucination Detection [2.5782420501870296]
We present MedHal, a novel large-scale dataset specifically designed to evaluate if models can detect hallucinations in medical texts. MedHal addresses gaps by: (1) incorporating diverse medical text sources and tasks; (2) providing a substantial volume of annotated samples suitable for training medical hallucination detection models; and (3) including explanations for factual inconsistencies to guide model learning.
arXiv Detail & Related papers (2025-04-11T14:55:15Z)
MedHallTune: An Instruction-Tuning Benchmark for Mitigating Medical Hallucination in Vision-Language Models [81.64135119165277]
hallucinations can jeopardize clinical decision making, potentially harming the diagnosis and treatments. We propose MedHallTune, a large-scale benchmark designed specifically to evaluate and mitigate hallucinations in medical VLMs. We conduct a comprehensive evaluation of current medical and general VLMs using MedHallTune, assessing their performance across key metrics, including clinical accuracy, relevance, detail level, and risk level.
arXiv Detail & Related papers (2025-02-28T06:59:49Z)
Medico: Towards Hallucination Detection and Correction with Multi-source Evidence Fusion [21.565157677548854]
hallucinations prevail in Large Language Models (LLMs), where the generated content is coherent but factually incorrect. We present Medico, a Multi-source evidence fusion enhanced hallucination detection and correction framework. It fuses diverse evidence from multiple sources, detects whether the generated content contains factual errors, provides the rationale behind the judgment, and iteratively revises the hallucinated content.
arXiv Detail & Related papers (2024-10-14T12:00:58Z)
Hallu-PI: Evaluating Hallucination in Multi-modal Large Language Models within Perturbed Inputs [54.50483041708911]
Hallu-PI is the first benchmark designed to evaluate hallucination in MLLMs within Perturbed Inputs. Hallu-PI consists of seven perturbed scenarios, containing 1,260 perturbed images from 11 object types. Our research reveals a severe bias in MLLMs' ability to handle different types of hallucinations.
arXiv Detail & Related papers (2024-08-02T16:07:15Z)
Detecting and Evaluating Medical Hallucinations in Large Vision Language Models [22.30139330566514]
Large Vision Language Models (LVLMs) are increasingly integral to healthcare applications. LVLMs inherit susceptibility to hallucinations-a significant concern in high-stakes medical contexts. We introduce Med-HallMark, the first benchmark specifically designed for hallucination detection and evaluation. We also present MediHallDetector, a novel Medical LVLM engineered for precise hallucination detection.
arXiv Detail & Related papers (2024-06-14T17:14:22Z)
HalluDial: A Large-Scale Benchmark for Automatic Dialogue-Level Hallucination Evaluation [19.318217051269382]
Large Language Models (LLMs) have significantly advanced the field of Natural Language Processing (NLP) HalluDial is the first comprehensive large-scale benchmark for automatic dialogue-level hallucination evaluation. The benchmark includes 4,094 dialogues with a total of 146,856 samples.
arXiv Detail & Related papers (2024-06-11T08:56:18Z)
Fine-grained Hallucination Detection and Editing for Language Models [109.56911670376932]
Large language models (LMs) are prone to generate factual errors, which are often called hallucinations. We introduce a comprehensive taxonomy of hallucinations and argue that hallucinations manifest in diverse forms. We propose a novel task of automatic fine-grained hallucination detection and construct a new evaluation benchmark, FavaBench.
arXiv Detail & Related papers (2024-01-12T19:02:48Z)
Alleviating Hallucinations of Large Language Models through Induced Hallucinations [67.35512483340837]
Large language models (LLMs) have been observed to generate responses that include inaccurate or fabricated information. We propose a simple textitInduce-then-Contrast Decoding (ICD) strategy to alleviate hallucinations.
arXiv Detail & Related papers (2023-12-25T12:32:49Z)
Evaluating Hallucinations in Chinese Large Language Models [65.4771562909392]
We establish a benchmark named HalluQA (Chinese Hallucination Question-Answering) to measure the hallucination phenomenon in Chinese large language models. We consider two types of hallucinations: imitative falsehoods and factual errors, and we construct adversarial samples based on GLM-130B and ChatGPT. For evaluation, we design an automated evaluation method using GPT-4 to judge whether a model output is hallucinated.
arXiv Detail & Related papers (2023-10-05T07:57:09Z)
Med-HALT: Medical Domain Hallucination Test for Large Language Models [0.0]
This research paper focuses on the challenges posed by hallucinations in large language models (LLMs) We propose a new benchmark and dataset, Med-HALT (Medical Domain Hallucination Test), designed specifically to evaluate and reduce hallucinations.
arXiv Detail & Related papers (2023-07-28T06:43:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.