MedHal: An Evaluation Dataset for Medical Hallucination Detection
- URL: http://arxiv.org/abs/2504.08596v1
- Date: Fri, 11 Apr 2025 14:55:15 GMT
- Title: MedHal: An Evaluation Dataset for Medical Hallucination Detection
- Authors: Gaya Mehenni, Amal Zouaq,
- Abstract summary: We present MedHal, a novel large-scale dataset specifically designed to evaluate if models can detect hallucinations in medical texts.<n>MedHal addresses gaps by: (1) incorporating diverse medical text sources and tasks; (2) providing a substantial volume of annotated samples suitable for training medical hallucination detection models; and (3) including explanations for factual inconsistencies to guide model learning.
- Score: 2.5782420501870296
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present MedHal, a novel large-scale dataset specifically designed to evaluate if models can detect hallucinations in medical texts. Current hallucination detection methods face significant limitations when applied to specialized domains like medicine, where they can have disastrous consequences. Existing medical datasets are either too small, containing only a few hundred samples, or focus on a single task like Question Answering or Natural Language Inference. MedHal addresses these gaps by: (1) incorporating diverse medical text sources and tasks; (2) providing a substantial volume of annotated samples suitable for training medical hallucination detection models; and (3) including explanations for factual inconsistencies to guide model learning. We demonstrate MedHal's utility by training and evaluating a baseline medical hallucination detection model, showing improvements over general-purpose hallucination detection approaches. This resource enables more efficient evaluation of medical text generation systems while reducing reliance on costly expert review, potentially accelerating the development of medical AI research.
Related papers
- MedHallTune: An Instruction-Tuning Benchmark for Mitigating Medical Hallucination in Vision-Language Models [81.64135119165277]
hallucinations can jeopardize clinical decision making, potentially harming the diagnosis and treatments.<n>We propose MedHallTune, a large-scale benchmark designed specifically to evaluate and mitigate hallucinations in medical VLMs.<n>We conduct a comprehensive evaluation of current medical and general VLMs using MedHallTune, assessing their performance across key metrics, including clinical accuracy, relevance, detail level, and risk level.
arXiv Detail & Related papers (2025-02-28T06:59:49Z) - Medical Hallucinations in Foundation Models and Their Impact on Healthcare [53.97060824532454]
Foundation Models that are capable of processing and generating multi-modal data have transformed AI's role in medicine.<n>We define medical hallucination as any instance in which a model generates misleading medical content.<n>Our results reveal that inference techniques such as Chain-of-Thought (CoT) and Search Augmented Generation can effectively reduce hallucination rates.<n>These findings underscore the ethical and practical imperative for robust detection and mitigation strategies.
arXiv Detail & Related papers (2025-02-26T02:30:44Z) - MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models [82.30696225661615]
We introduce MedHallu, the first benchmark specifically designed for medical hallucination detection.<n>We show that state-of-the-art LLMs, including GPT-4o, Llama-3.1, and medically fine-tuned UltraMedical, struggle with this binary hallucination detection task.<n>Using bidirectional entailment clustering, we show that harder-to-detect hallucinations are semantically closer to ground truth.
arXiv Detail & Related papers (2025-02-20T06:33:23Z) - MedVH: Towards Systematic Evaluation of Hallucination for Large Vision Language Models in the Medical Context [21.562034852024272]
Large Vision Language Models (LVLMs) have recently achieved superior performance in various tasks on natural image and text data.
Despite their advancements, there has been scant research on the robustness of these models against hallucination when fine-tuned on smaller datasets.
We introduce a new benchmark dataset, the Medical Visual Hallucination Test (MedVH), to evaluate the hallucination of domain-specific LVLMs.
arXiv Detail & Related papers (2024-07-03T00:59:03Z) - CoMT: Chain-of-Medical-Thought Reduces Hallucination in Medical Report Generation [20.59298361626719]
We propose a chain-of-medical-thought approach (CoMT) to mitigate hallucinations in medical report generation.<n>CoMT intends to imitate the cognitive process of human doctors by decomposing diagnostic procedures.
arXiv Detail & Related papers (2024-06-17T12:03:32Z) - Detecting and Evaluating Medical Hallucinations in Large Vision Language Models [22.30139330566514]
Large Vision Language Models (LVLMs) are increasingly integral to healthcare applications.
LVLMs inherit susceptibility to hallucinations-a significant concern in high-stakes medical contexts.
We introduce Med-HallMark, the first benchmark specifically designed for hallucination detection and evaluation.
We also present MediHallDetector, a novel Medical LVLM engineered for precise hallucination detection.
arXiv Detail & Related papers (2024-06-14T17:14:22Z) - Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - Quantity Matters: Towards Assessing and Mitigating Number Hallucination in Large Vision-Language Models [57.42800112251644]
We focus on a specific type of hallucination-number hallucination, referring to models incorrectly identifying the number of certain objects in pictures.
We devise a training approach aimed at improving consistency to reduce number hallucinations, which leads to an 8% enhancement in performance over direct finetuning methods.
arXiv Detail & Related papers (2024-03-03T02:31:11Z) - A Data-Centric Approach To Generate Faithful and High Quality Patient Summaries with Large Language Models [11.218649399559691]
Fine-tuning on hallucination-free data effectively reduces hallucinations from 2.60 to 1.55 per summary for Llama 2.
We find that common quantitative metrics do not correlate well with faithfulness and quality.
arXiv Detail & Related papers (2024-02-23T16:32:28Z) - Med-HALT: Medical Domain Hallucination Test for Large Language Models [0.0]
This research paper focuses on the challenges posed by hallucinations in large language models (LLMs)
We propose a new benchmark and dataset, Med-HALT (Medical Domain Hallucination Test), designed specifically to evaluate and reduce hallucinations.
arXiv Detail & Related papers (2023-07-28T06:43:04Z) - Semi-Supervised Variational Reasoning for Medical Dialogue Generation [70.838542865384]
Two key characteristics are relevant for medical dialogue generation: patient states and physician actions.
We propose an end-to-end variational reasoning approach to medical dialogue generation.
A physician policy network composed of an action-classifier and two reasoning detectors is proposed for augmented reasoning ability.
arXiv Detail & Related papers (2021-05-13T04:14:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.