Med-HALT: Medical Domain Hallucination Test for Large Language Models
- URL: http://arxiv.org/abs/2307.15343v2
- Date: Sat, 14 Oct 2023 17:04:12 GMT
- Title: Med-HALT: Medical Domain Hallucination Test for Large Language Models
- Authors: Ankit Pal, Logesh Kumar Umapathi and Malaikannan Sankarasubbu
- Abstract summary: This research paper focuses on the challenges posed by hallucinations in large language models (LLMs)
We propose a new benchmark and dataset, Med-HALT (Medical Domain Hallucination Test), designed specifically to evaluate and reduce hallucinations.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This research paper focuses on the challenges posed by hallucinations in
large language models (LLMs), particularly in the context of the medical
domain. Hallucination, wherein these models generate plausible yet unverified
or incorrect information, can have serious consequences in healthcare
applications. We propose a new benchmark and dataset, Med-HALT (Medical Domain
Hallucination Test), designed specifically to evaluate and reduce
hallucinations. Med-HALT provides a diverse multinational dataset derived from
medical examinations across various countries and includes multiple innovative
testing modalities. Med-HALT includes two categories of tests reasoning and
memory-based hallucination tests, designed to assess LLMs's problem-solving and
information retrieval abilities.
Our study evaluated leading LLMs, including Text Davinci, GPT-3.5, LlaMa-2,
MPT, and Falcon, revealing significant differences in their performance. The
paper provides detailed insights into the dataset, promoting transparency and
reproducibility. Through this work, we aim to contribute to the development of
safer and more reliable language models in healthcare. Our benchmark can be
found at medhalt.github.io
Related papers
- MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models [82.30696225661615]
We introduce MedHallu, the first benchmark specifically designed for medical hallucination detection.
We show that state-of-the-art LLMs, including GPT-4o, Llama-3.1, and medically fine-tuned UltraMedical, struggle with this binary hallucination detection task.
Using bidirectional entailment clustering, we show that harder-to-detect hallucinations are semantically closer to ground truth.
arXiv Detail & Related papers (2025-02-20T06:33:23Z) - HALLUCINOGEN: A Benchmark for Evaluating Object Hallucination in Large Visual-Language Models [57.58426038241812]
Large Vision-Language Models (LVLMs) have demonstrated remarkable performance in performing complex multimodal tasks.
We propose HALLUCINOGEN, a novel visual question answering (VQA) object hallucination attack benchmark.
We extend our benchmark to high-stakes medical applications and introduce MED-HALLUCINOGEN, hallucination attacks tailored to the biomedical domain.
arXiv Detail & Related papers (2024-12-29T23:56:01Z) - From Single to Multi: How LLMs Hallucinate in Multi-Document Summarization [6.37435726278524]
We investigate how hallucinations manifest in large language models (LLMs) when summarizing topic-specific information from multiple documents.
On average, up to 75% of the content in LLM-generated summary is hallucinated, with hallucinations more likely to occur towards the end of the summaries.
To understand the characteristics of these hallucinations, we manually evaluate 700+ insights and find that most errors stem from either failing to follow instructions or producing overly generic insights.
arXiv Detail & Related papers (2024-10-17T18:38:53Z) - LongHalQA: Long-Context Hallucination Evaluation for MultiModal Large Language Models [96.64960606650115]
LongHalQA is an LLM-free hallucination benchmark that comprises 6K long and complex hallucination text.
LongHalQA is featured by GPT4V-generated hallucinatory data that are well aligned with real-world scenarios.
arXiv Detail & Related papers (2024-10-13T18:59:58Z) - Towards Evaluating and Building Versatile Large Language Models for Medicine [57.49547766838095]
We present MedS-Bench, a benchmark designed to evaluate the performance of large language models (LLMs) in clinical contexts.
MedS-Bench spans 11 high-level clinical tasks, including clinical report summarization, treatment recommendations, diagnosis, named entity recognition, and medical concept explanation.
MedS-Ins comprises 58 medically oriented language corpora, totaling 13.5 million samples across 122 tasks.
arXiv Detail & Related papers (2024-08-22T17:01:34Z) - MedVH: Towards Systematic Evaluation of Hallucination for Large Vision Language Models in the Medical Context [21.562034852024272]
Large Vision Language Models (LVLMs) have recently achieved superior performance in various tasks on natural image and text data.
Despite their advancements, there has been scant research on the robustness of these models against hallucination when fine-tuned on smaller datasets.
We introduce a new benchmark dataset, the Medical Visual Hallucination Test (MedVH), to evaluate the hallucination of domain-specific LVLMs.
arXiv Detail & Related papers (2024-07-03T00:59:03Z) - Detecting and Evaluating Medical Hallucinations in Large Vision Language Models [22.30139330566514]
Large Vision Language Models (LVLMs) are increasingly integral to healthcare applications.
LVLMs inherit susceptibility to hallucinations-a significant concern in high-stakes medical contexts.
We introduce Med-HallMark, the first benchmark specifically designed for hallucination detection and evaluation.
We also present MediHallDetector, a novel Medical LVLM engineered for precise hallucination detection.
arXiv Detail & Related papers (2024-06-14T17:14:22Z) - MultifacetEval: Multifaceted Evaluation to Probe LLMs in Mastering Medical Knowledge [4.8004472307210255]
Large language models (LLMs) have excelled across domains, delivering notable performance on medical evaluation benchmarks.
However, there still exists a significant gap between the reported performance and the practical effectiveness in real-world medical scenarios.
We develop a novel evaluation framework MultifacetEval to examine the degree and coverage of LLMs in encoding and mastering medical knowledge.
arXiv Detail & Related papers (2024-06-05T04:15:07Z) - Fine-grained Hallucination Detection and Editing for Language Models [109.56911670376932]
Large language models (LMs) are prone to generate factual errors, which are often called hallucinations.
We introduce a comprehensive taxonomy of hallucinations and argue that hallucinations manifest in diverse forms.
We propose a novel task of automatic fine-grained hallucination detection and construct a new evaluation benchmark, FavaBench.
arXiv Detail & Related papers (2024-01-12T19:02:48Z) - HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data [102.56792377624927]
hallucinations inherent in machine-generated data remain under-explored.
We present a novel hallucination detection and elimination framework, HalluciDoctor, based on the cross-checking paradigm.
Our method successfully mitigates 44.6% hallucinations relatively and maintains competitive performance compared to LLaVA.
arXiv Detail & Related papers (2023-11-22T04:52:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.