Related papers: Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

URL: http://arxiv.org/abs/2406.10185v1
Date: Fri, 14 Jun 2024 17:14:22 GMT
Title: Detecting and Evaluating Medical Hallucinations in Large Vision Language Models
Authors: Jiawei Chen, Dingkang Yang, Tong Wu, Yue Jiang, Xiaolu Hou, Mingcheng Li, Shunli Wang, Dongling Xiao, Ke Li, Lihua Zhang,
Abstract summary: Large Vision Language Models (LVLMs) are increasingly integral to healthcare applications. LVLMs inherit susceptibility to hallucinations-a significant concern in high-stakes medical contexts. We introduce Med-HallMark, the first benchmark specifically designed for hallucination detection and evaluation. We also present MediHallDetector, a novel Medical LVLM engineered for precise hallucination detection.
Score: 22.30139330566514
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Vision Language Models (LVLMs) are increasingly integral to healthcare applications, including medical visual question answering and imaging report generation. While these models inherit the robust capabilities of foundational Large Language Models (LLMs), they also inherit susceptibility to hallucinations-a significant concern in high-stakes medical contexts where the margin for error is minimal. However, currently, there are no dedicated methods or benchmarks for hallucination detection and evaluation in the medical field. To bridge this gap, we introduce Med-HallMark, the first benchmark specifically designed for hallucination detection and evaluation within the medical multimodal domain. This benchmark provides multi-tasking hallucination support, multifaceted hallucination data, and hierarchical hallucination categorization. Furthermore, we propose the MediHall Score, a new medical evaluative metric designed to assess LVLMs' hallucinations through a hierarchical scoring system that considers the severity and type of hallucination, thereby enabling a granular assessment of potential clinical impacts. We also present MediHallDetector, a novel Medical LVLM engineered for precise hallucination detection, which employs multitask training for hallucination detection. Through extensive experimental evaluations, we establish baselines for popular LVLMs using our benchmark. The findings indicate that MediHall Score provides a more nuanced understanding of hallucination impacts compared to traditional metrics and demonstrate the enhanced performance of MediHallDetector. We hope this work can significantly improve the reliability of LVLMs in medical applications. All resources of this work will be released soon.

Related papers

EH-Benchmark Ophthalmic Hallucination Benchmark and Agent-Driven Top-Down Traceable Reasoning Workflow [43.82288530883818]
EH-Benchmark is a novel ophthalmology benchmark designed to evaluate hallucinations in Medical Large Language Models.<n>We categorize hallucinations based on specific tasks and error types into two primary classes: Visual Understanding and Logical Composition.<n>Our framework significantly mitigates both types of hallucinations, enhancing accuracy, interpretability, and reliability.
arXiv Detail & Related papers (2025-07-24T12:07:36Z)
MedHEval: Benchmarking Hallucinations and Mitigation Strategies in Medical Large Vision-Language Models [37.78272983522441]
Large Vision Language Models (LVLMs) are becoming increasingly important in the medical domain. MedHEval is a novel benchmark that systematically evaluates hallucinations and mitigation strategies in Med-LVLMs. We conduct experiments across 11 popular (Med)-LVLMs and evaluate 7 state-of-the-art hallucination mitigation techniques.
arXiv Detail & Related papers (2025-03-04T00:40:09Z)
MedHallTune: An Instruction-Tuning Benchmark for Mitigating Medical Hallucination in Vision-Language Models [81.64135119165277]
hallucinations can jeopardize clinical decision making, potentially harming the diagnosis and treatments. We propose MedHallTune, a large-scale benchmark designed specifically to evaluate and mitigate hallucinations in medical VLMs. We conduct a comprehensive evaluation of current medical and general VLMs using MedHallTune, assessing their performance across key metrics, including clinical accuracy, relevance, detail level, and risk level.
arXiv Detail & Related papers (2025-02-28T06:59:49Z)
MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models [82.30696225661615]
We introduce MedHallu, the first benchmark specifically designed for medical hallucination detection. We show that state-of-the-art LLMs, including GPT-4o, Llama-3.1, and medically fine-tuned UltraMedical, struggle with this binary hallucination detection task. Using bidirectional entailment clustering, we show that harder-to-detect hallucinations are semantically closer to ground truth.
arXiv Detail & Related papers (2025-02-20T06:33:23Z)
MedHallBench: A New Benchmark for Assessing Hallucination in Medical Large Language Models [0.0]
Medical Large Language Models (MLLMs) have demonstrated potential in healthcare applications. Their propensity for hallucinations presents substantial risks to patient care. This paper introduces MedHallBench, a comprehensive benchmark framework for evaluating and mitigating hallucinations in MLLMs.
arXiv Detail & Related papers (2024-12-25T16:51:29Z)
Combating Multimodal LLM Hallucination via Bottom-Up Holistic Reasoning [151.4060202671114]
multimodal large language models (MLLMs) have shown unprecedented capabilities in advancing vision-language tasks. This paper introduces a novel bottom-up reasoning framework to address hallucinations in MLLMs. Our framework systematically addresses potential issues in both visual and textual inputs by verifying and integrating perception-level information with cognition-level commonsense knowledge.
arXiv Detail & Related papers (2024-12-15T09:10:46Z)
A Survey of Hallucination in Large Visual Language Models [48.794850395309076]
The existence of hallucinations has limited the potential and practical effectiveness of LVLM in various fields. The structure of LVLMs and main causes of hallucination generation are introduced. The available hallucination evaluation benchmarks for LVLMs are presented.
arXiv Detail & Related papers (2024-10-20T10:58:58Z)
Hallu-PI: Evaluating Hallucination in Multi-modal Large Language Models within Perturbed Inputs [54.50483041708911]
Hallu-PI is the first benchmark designed to evaluate hallucination in MLLMs within Perturbed Inputs. Hallu-PI consists of seven perturbed scenarios, containing 1,260 perturbed images from 11 object types. Our research reveals a severe bias in MLLMs' ability to handle different types of hallucinations.
arXiv Detail & Related papers (2024-08-02T16:07:15Z)
MedVH: Towards Systematic Evaluation of Hallucination for Large Vision Language Models in the Medical Context [21.562034852024272]
Large Vision Language Models (LVLMs) have recently achieved superior performance in various tasks on natural image and text data. Despite their advancements, there has been scant research on the robustness of these models against hallucination when fine-tuned on smaller datasets. We introduce a new benchmark dataset, the Medical Visual Hallucination Test (MedVH), to evaluate the hallucination of domain-specific LVLMs.
arXiv Detail & Related papers (2024-07-03T00:59:03Z)
Evaluating the Quality of Hallucination Benchmarks for Large Vision-Language Models [67.89204055004028]
Large Vision-Language Models (LVLMs) have been plagued by the issue of hallucination. Previous works have proposed a series of benchmarks featuring different types of tasks and evaluation metrics. We propose a Hallucination benchmark Quality Measurement framework (HQM) to assess the reliability and validity of existing hallucination benchmarks.
arXiv Detail & Related papers (2024-06-24T20:08:07Z)
HalluDial: A Large-Scale Benchmark for Automatic Dialogue-Level Hallucination Evaluation [19.318217051269382]
Large Language Models (LLMs) have significantly advanced the field of Natural Language Processing (NLP) HalluDial is the first comprehensive large-scale benchmark for automatic dialogue-level hallucination evaluation. The benchmark includes 4,094 dialogues with a total of 146,856 samples.
arXiv Detail & Related papers (2024-06-11T08:56:18Z)
Unified Hallucination Detection for Multimodal Large Language Models [44.333451078750954]
Multimodal Large Language Models (MLLMs) are plagued by the critical issue of hallucination. We present a novel meta-evaluation benchmark, MHaluBench, meticulously crafted to facilitate the evaluation of advancements in hallucination detection methods. We unveil a novel unified multimodal hallucination detection framework, UNIHD, which leverages a suite of auxiliary tools to validate the occurrence of hallucinations robustly.
arXiv Detail & Related papers (2024-02-05T16:56:11Z)
HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data [102.56792377624927]
hallucinations inherent in machine-generated data remain under-explored. We present a novel hallucination detection and elimination framework, HalluciDoctor, based on the cross-checking paradigm. Our method successfully mitigates 44.6% hallucinations relatively and maintains competitive performance compared to LLaVA.
arXiv Detail & Related papers (2023-11-22T04:52:58Z)
Evaluation and Analysis of Hallucination in Large Vision-Language Models [49.19829480199372]
Large Vision-Language Models (LVLMs) have recently achieved remarkable success. LVLMs are still plagued by the hallucination problem. Hallucination refers to the information of LVLMs' responses that does not exist in the visual input.
arXiv Detail & Related papers (2023-08-29T08:51:24Z)
Med-HALT: Medical Domain Hallucination Test for Large Language Models [0.0]
This research paper focuses on the challenges posed by hallucinations in large language models (LLMs) We propose a new benchmark and dataset, Med-HALT (Medical Domain Hallucination Test), designed specifically to evaluate and reduce hallucinations.
arXiv Detail & Related papers (2023-07-28T06:43:04Z)
Evaluating Object Hallucination in Large Vision-Language Models [122.40337582958453]
This work presents the first systematic study on object hallucination of large vision-language models (LVLMs) We find that LVLMs tend to generate objects that are inconsistent with the target images in the descriptions. We propose a polling-based query method called POPE to evaluate the object hallucination.
arXiv Detail & Related papers (2023-05-17T16:34:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.