MedHEval: Benchmarking Hallucinations and Mitigation Strategies in Medical Large Vision-Language Models
- URL: http://arxiv.org/abs/2503.02157v1
- Date: Tue, 04 Mar 2025 00:40:09 GMT
- Title: MedHEval: Benchmarking Hallucinations and Mitigation Strategies in Medical Large Vision-Language Models
- Authors: Aofei Chang, Le Huang, Parminder Bhatia, Taha Kass-Hout, Fenglong Ma, Cao Xiao,
- Abstract summary: Large Vision Language Models (LVLMs) are becoming increasingly important in the medical domain.<n>MedHEval is a novel benchmark that systematically evaluates hallucinations and mitigation strategies in Med-LVLMs.<n>We conduct experiments across 11 popular (Med)-LVLMs and evaluate 7 state-of-the-art hallucination mitigation techniques.
- Score: 37.78272983522441
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Vision Language Models (LVLMs) are becoming increasingly important in the medical domain, yet Medical LVLMs (Med-LVLMs) frequently generate hallucinations due to limited expertise and the complexity of medical applications. Existing benchmarks fail to effectively evaluate hallucinations based on their underlying causes and lack assessments of mitigation strategies. To address this gap, we introduce MedHEval, a novel benchmark that systematically evaluates hallucinations and mitigation strategies in Med-LVLMs by categorizing them into three underlying causes: visual misinterpretation, knowledge deficiency, and context misalignment. We construct a diverse set of close- and open-ended medical VQA datasets with comprehensive evaluation metrics to assess these hallucination types. We conduct extensive experiments across 11 popular (Med)-LVLMs and evaluate 7 state-of-the-art hallucination mitigation techniques. Results reveal that Med-LVLMs struggle with hallucinations arising from different causes while existing mitigation methods show limited effectiveness, especially for knowledge- and context-based errors. These findings underscore the need for improved alignment training and specialized mitigation strategies to enhance Med-LVLMs' reliability. MedHEval establishes a standardized framework for evaluating and mitigating medical hallucinations, guiding the development of more trustworthy Med-LVLMs.
Related papers
- MedHallTune: An Instruction-Tuning Benchmark for Mitigating Medical Hallucination in Vision-Language Models [81.64135119165277]
hallucinations can jeopardize clinical decision making, potentially harming the diagnosis and treatments.<n>We propose MedHallTune, a large-scale benchmark designed specifically to evaluate and mitigate hallucinations in medical VLMs.<n>We conduct a comprehensive evaluation of current medical and general VLMs using MedHallTune, assessing their performance across key metrics, including clinical accuracy, relevance, detail level, and risk level.
arXiv Detail & Related papers (2025-02-28T06:59:49Z) - HALLUCINOGEN: A Benchmark for Evaluating Object Hallucination in Large Visual-Language Models [57.58426038241812]
Large Vision-Language Models (LVLMs) have demonstrated remarkable performance in performing complex multimodal tasks.<n>We propose HALLUCINOGEN, a novel visual question answering (VQA) object hallucination attack benchmark.<n>We extend our benchmark to high-stakes medical applications and introduce MED-HALLUCINOGEN, hallucination attacks tailored to the biomedical domain.
arXiv Detail & Related papers (2024-12-29T23:56:01Z) - Combating Multimodal LLM Hallucination via Bottom-Up Holistic Reasoning [151.4060202671114]
multimodal large language models (MLLMs) have shown unprecedented capabilities in advancing vision-language tasks.
This paper introduces a novel bottom-up reasoning framework to address hallucinations in MLLMs.
Our framework systematically addresses potential issues in both visual and textual inputs by verifying and integrating perception-level information with cognition-level commonsense knowledge.
arXiv Detail & Related papers (2024-12-15T09:10:46Z) - A Survey of Hallucination in Large Visual Language Models [48.794850395309076]
The existence of hallucinations has limited the potential and practical effectiveness of LVLM in various fields.
The structure of LVLMs and main causes of hallucination generation are introduced.
The available hallucination evaluation benchmarks for LVLMs are presented.
arXiv Detail & Related papers (2024-10-20T10:58:58Z) - Prompting Medical Large Vision-Language Models to Diagnose Pathologies by Visual Question Answering [6.087954428369633]
We propose two prompting strategies for MLVLMs that reduce hallucination and improve VQA performance.<n> tested on the MIMIC-CXR-JPG and Chexpert datasets, our methods significantly improve the diagnostic F1 score.<n>Based on POPE metrics, it effectively suppresses the false negative predictions of existing LVLMs and improves Recall by approximately 0.07.
arXiv Detail & Related papers (2024-07-31T06:34:38Z) - MedVH: Towards Systematic Evaluation of Hallucination for Large Vision Language Models in the Medical Context [21.562034852024272]
Large Vision Language Models (LVLMs) have recently achieved superior performance in various tasks on natural image and text data.
Despite their advancements, there has been scant research on the robustness of these models against hallucination when fine-tuned on smaller datasets.
We introduce a new benchmark dataset, the Medical Visual Hallucination Test (MedVH), to evaluate the hallucination of domain-specific LVLMs.
arXiv Detail & Related papers (2024-07-03T00:59:03Z) - Detecting and Evaluating Medical Hallucinations in Large Vision Language Models [22.30139330566514]
Large Vision Language Models (LVLMs) are increasingly integral to healthcare applications.
LVLMs inherit susceptibility to hallucinations-a significant concern in high-stakes medical contexts.
We introduce Med-HallMark, the first benchmark specifically designed for hallucination detection and evaluation.
We also present MediHallDetector, a novel Medical LVLM engineered for precise hallucination detection.
arXiv Detail & Related papers (2024-06-14T17:14:22Z) - Hallucination of Multimodal Large Language Models: A Survey [40.73148186369018]
multimodal large language models (MLLMs) have demonstrated significant advancements and remarkable abilities in multimodal tasks.
Despite these promising developments, MLLMs often generate outputs that are inconsistent with the visual content.
This survey aims to deepen the understanding of hallucinations in MLLMs and inspire further advancements in the field.
arXiv Detail & Related papers (2024-04-29T17:59:41Z) - Evaluation and Analysis of Hallucination in Large Vision-Language Models [49.19829480199372]
Large Vision-Language Models (LVLMs) have recently achieved remarkable success.
LVLMs are still plagued by the hallucination problem.
Hallucination refers to the information of LVLMs' responses that does not exist in the visual input.
arXiv Detail & Related papers (2023-08-29T08:51:24Z) - Evaluating Object Hallucination in Large Vision-Language Models [122.40337582958453]
This work presents the first systematic study on object hallucination of large vision-language models (LVLMs)
We find that LVLMs tend to generate objects that are inconsistent with the target images in the descriptions.
We propose a polling-based query method called POPE to evaluate the object hallucination.
arXiv Detail & Related papers (2023-05-17T16:34:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.